But there is an even faster alternative: the data.table package. Therefore, doing it with half the sample size requires only more or less a quarter of the resources, not half.Įspecially for data handling, dplyr is much more elegant than base R, and often faster. They often grow quadratically with the number of observations, and so do the computational costs and the required working memory. This sounds counterintuitive, but the reason is the following: Many methods (e.g., regression analysis) work with matrices. Sometimes you can even execute those computations in parallel, even if working memory was not sufficient to do it on the whole dataset. Another option is to divide your data into multiple parts, do your computations on each part separately, and recombine them (e.g., by averaging regression coefficients). At least during code development, this can be very useful. But the good news is that it’s often totally sufficient to work on a sample – for instance, to compute summary statistics or estimate a regression model. Some computations will not only become very slow, but even impossible for large datasets, for example, due to working memory.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |