Quant面试『真题』系列：第一期

量化投资与机器学习微信公众号，是业内垂直于量化投资、对冲基金、Fintech、人工智能、大数据等领域的主流自媒体。公众号拥有来自公募、私募、券商、期货、银行、保险、高校等行业30W+关注者，连续2年被腾讯云+社区评选为“年度最佳作者”。

量化投资与机器学公众号在2022年又双叒叕开启了一个全新系列：

QIML汇集了来自全球顶尖对冲基金、互联网大厂的真实面试题目。希望给各位读者带来不一样的求职与学习体验！

第一期

▌出题机构：AQR

▌题目难度：Easy

题目

Say that you are running a multiple linear regression and that you have reason to believe that several of the predictors are correlated. How will the results of the regression be affected if several are indeed correlated? How would you deal with this problem?

答案

There will be two primary problems when running a regression if several of the predictor variables are correlated. The first is that the coefficient estimates and signs will vary dramatically, depending on what particular variables you included in the model. Certain coefficients may even have confidence intervals that include 0 (meaning it is difficult to tell whether an increase in that X value is associated with an increase or decrease in Y or not), and hence results will not be statistically significant. The second is that the resulting p-values will be misleading. For instance, an important variable might have a high p-value and so be deemed as statistically insignificant even though it is actually important. It is as if the effect of the correlated features were “split” between them, leading to uncertainty about which features are actually relevant to the model.

You can deal this problem by either removing or combining the correlated predictors. To effectively remove one of the predictors, it is best to understand the cause of the correlation (i.e., did you include extraneous predictors such as X and 2X or are there some latent variables underlying one or more of the ones you have included that affect both? To combine predictors, it is possible to include interaction terms (the product of the two that are correlated). Additionally, you could also (1) center the data and (2) try to obtain a larger size of sample, thereby giving you narrower confidence intervals. Lastly, you can apply regularization methods (such as in ridge regression) )

---

▌出题机构：Point72

▌题目难度：Easy

题目

Describe the motivation behind random forests. What are two ways in which thet improve upon individual decision trees?

答案

Random forests are used since individual decision trees are usually prone to overfitting. Not only can these utilize multiple decision trees and then average their decisions, but they can be used for either classification or regression. There are a few main ways in which they allow for stronger out-of-sample prediction than do individual decision trees.

* As in other ensemble models, using a large set of trees created in a resample of data (bootstrap aggregation) will lead to a model yielding more consistent results, More specifically, and in contrast to decision trees, it leads to diversity in training data for each tree and so contributes to better results in terms of bias-variance trade-off (particularly with respect to variance).

* Using only m < p features at each split helps to de-correlate the decision trees, thereby avoiding having very important features always appearing at the first splits of the trees (which would happen on standalone trees due to the nature of information gain).

* They’re fairly easy to implement and fast to run.

* They can produce very interpretable feature-importance values, thereby improving model understandability and feature selection.

The first two bullet points are the main ways random forests improve upon single decision trees.

---

▌出题机构：Two Sigma

▌题目难度：Easy

题目

Say you were running a linear regression for a dataset but you accidentally duplicated every data point. What happens to your beta coefficient?

答案

we see that the coefficient remains unchanged.

---

▌出题机构：Robinhood

▌题目难度：Easy

题目

Say you are building a binary classifier for an unbalanced dataset (where one class is much rarer than the other, say 1% and 99%, respectively). How do you handle this situation?

答案

Unbalanced classes can be dealt with in several ways.

First, you want to check whether you can get more data or not. While in many scenarios, data may be expensive or difficult to acquire, it’s important to not overlook this approach, and at least mention it to your interviewer.

Next, make sure you’re looking at appropriate metrics,. For example, accuracy is not a correct metric to use when classes are imbalanced — instead, you want to look at precision, recall, F1 score, and the ROC curve.

Then, you can resample the training set by either oversampling the rare samples or undersampling the abundant samples; both can be accomplished via bootstrapping. These approaches are easy and quick to run, so they should be good starting points. Note, if the event is inherently rare, then oversampling may not be necessary, and you should focus more on the evaluation function.

Additionally, you could try generating synthetic examples. There are several algorithms for doing so - the most popular is called SMOTE (synthetic minority oversampling technique), which creates synthetic samples of the rare class rather than pure copies by selecting various instances. It does this by modifying the attributes slightly by a random amount proportional to the difference in neighboring instances.

Another way is to resample classes by running ensemble models with different ratios of the classes, or by running an ensemble model using all samples of the rare class and a differing amount of the abundant class. Note that some models, such as logistic regression, are able to handle unbalanced classes relatively well in a standalone manner. You can also adjust the probability threshold to something besides 0.5 for classifying the unbalanced outcome.

Lastly, you can design your own cost function the penalizes wrong classification of the rare class more than wrong classifications of the abundant class. This is useful if you have to use a particular kind of model and you’re unable to resample. However, it can be complex to set up the penalty matrix, especially with many classes.

---

▌出题机构：Facebook

▌题目难度：Easy

题目

When performing K-means clustering, how do you choose k?

答案

The elbow method is the most well-known method for choosing k in k-means clustering. The intuition behind this technique is that first few clusters will explain a lot of the variation in the data, but past a certain point, the amount of information added is diminishing. Looking at a graph of explained variation (on the y-axis) versus the number of clusters (k), there should be a sharp change in the y-axis at some level of k. For example, in the graph that follows, we see a dropoff at approximately k = 6.

Note that the explained variation is quantified by the within-cluster sum of squared errors. To calculate this error metric, we look at, for each cluster, the total sum of squared errors (using Euclidean distance). A caveat to keep in mind: the assumption of a drop in variation may not necessarily be true — the y-axis may be continuously decreasing slowly (i.e., there is no significant drop).

Another popular alternative to determining k in k-means clustering is to apply the silhouette method, which aims to measure how similar points are in its cluster compared to other clusters. Concretely, it looks at:

where x is the mean distance to the examples of the nearest cluster, and y is the mean distance to other examples in the same cluster. The coefficient varies between -1 and 1 for any given point. A value of 1 implies that the point is in the “right” cluster and vice versa for a score of -1. By plotting the score on the y-axis versus k, we can get an idea for the optimal number of clusters based on this metric. Note that the metric used in the silhouette method is more computationally intensive to calculate for all points versus the elbow method.

Taking a step back, while both the elbow and silhouette methods serve their purpose, sometimes it helps to lean on your business intuition when choosing the number of clusters. For example, if you are clustering patients or customer groups, stakeholders and subject matter experts should have a hunch concerning how many groups they expect to see in the data. Additionally, you can visualize the features for the different groups and assess whether they are indeed behaving similarly. There is no perfect method for picking k, because if there were, it would be a supervised problem and not an unsupervised one.

---

▌出题机构：PWC

▌题目难度：Easy

题目

Compare and contrast gradient boosting and random forests.

答案

The first main difference is that, in gradient boosting, trees are built one at a time, such that successive weak learners learn from the mistakes of preceding weak learners. In random forests, the trees are built independently at the same time.

The second difference is in the output: gradient boosting combines the results of the weak learners with each successive iteration, whereas, in random forests, the trees are combined at the end (through either averaging or majority).

Because of their structural differences, gradient boosting is often more prone to overfitting than are random forests due to their focus on mistakes over training iterations and the lack of independence in tree building. Additionally, gradient boosting hyper-parameters are harder to tune than those of random forests. Lastly, gradient boosting may take longer to train than random forests because the trees of the latter are built sequentially.

---

相关阅读

干翻机器学习面试！