From 97edb6804b84aa73c16181b05819e499cbf747dc Mon Sep 17 00:00:00 2001 From: Robin Lovelace Date: Thu, 26 Jan 2023 22:51:52 +0000 Subject: [PATCH] Shorten and improve sentences mentioning "big data" (#911) --- 10-gis.Rmd | 2 +- 16-synthesis.Rmd | 16 ++++++---------- 2 files changed, 7 insertions(+), 11 deletions(-) diff --git a/10-gis.Rmd b/10-gis.Rmd index ee6dd0749..ddbb97a73 100644 --- a/10-gis.Rmd +++ b/10-gis.Rmd @@ -858,7 +858,7 @@ But the same is true for the lightweight SQLite/SpatiaLite database engine and G If your datasets are too big for PostgreSQL/PostGIS and you require massive spatial data management and query performance, it may be worth exploring large-scale geographic querying on distributed computing systems. Such systems are outside the scope of this book but it worth mentioning that open source software providing this functionality exists. -Prominent projects in this space include [GeoMesa](http://www.geomesa.org/) and [Apache Sedona](https://sedona.apache.org/), formerly known as GeoSpark [@huang_geospark_2017], which has and R interface provided by the [**apache.sedona**](https://cran.r-project.org/package=apache.sedona) package. +Prominent projects in this space include [GeoMesa](http://www.geomesa.org/) and [Apache Sedona](https://sedona.apache.org/). The [**apache.sedona**](https://cran.r-project.org/package=apache.sedona) package provides an interace to the latter. ## Bridges to cloud technologies and services {#cloud} diff --git a/16-synthesis.Rmd b/16-synthesis.Rmd index 5c9014ef1..bcf59195e 100644 --- a/16-synthesis.Rmd +++ b/16-synthesis.Rmd @@ -127,16 +127,12 @@ Instead of covering spatial statistical modeling and inference techniques, we fo Again, the reason was that there are already excellent resources on these topics, especially with ecological use cases, including @zuur_mixed_2009, @zuur_beginners_2017 and freely available teaching material and code on *Geostatistics & Open-source Statistical Computing* by David Rossiter, hosted at [css.cornell.edu/faculty/dgr2](http://www.css.cornell.edu/faculty/dgr2/teach/) and the [*R for Geographic Data Science*](https://sdesabbata.github.io/r-for-geographic-data-science/) project by [Stefano De Sabbata](https://sdesabbata.github.io/) [at the University of Leicester](https://le.ac.uk/people/stefano-de-sabbata) for an introduction to R\index{R} for geographic data science\index{data science}. There are also excellent resources on spatial statistics\index{spatial!statistics} using Bayesian modeling, a powerful framework for modeling and uncertainty estimation [@blangiardo_spatial_2015;@krainski_advanced_2018]. -Finally, we have largely omitted big data\index{big data} analytics. -This might seem surprising since especially geographic data can become big really fast. -But the prerequisite for doing big data analytics is to know how to solve a problem on a small dataset. -Once you have learned that, you can apply the exact same techniques on big data questions, though of course you need to expand your toolbox. -The first thing to learn is to handle geographic data queries. -This is because big data\index{big data} analytics often boil down to extracting a small amount of data from a database for a specific statistical analysis. -For this, we have provided an introduction to spatial databases\index{spatial database} and how to use a GIS\index{GIS} from within R in Chapter \@ref(gis). -If you really have to do the analysis on a big or even the complete dataset, hopefully, the problem you are trying to solve is embarrassingly parallel. -For this, you need to learn a system that is able to do this parallelization efficiently such as [Apache Sedona](https://sedona.apache.org/), as mentioned in Section \@ref(postgis). -Regardless of dataset size, the techniques and concepts you have used on small datasets will be useful\index{big data} question, the only difference being the extra considterations when working in a big data setting. +We have largely omitted geocomputation on 'big data',\index{big data} by which we mean datasets that do not fit on consumer hardware or which cannot realistically be processed on a single CPU. +This decision is justified by the fact that the majority of geographic datasets that are needed for common research or policy applications *do* fit on consumer hardware, even if that may mean increasing the amount of RAM on your computer (or temporarily 'renting' compute power available on platforms such as [GitHub Codespaces](https://github.com/codespaces/new?hide_repo_select=true&ref=main&repo=84222786&machine=basicLinux32gb&devcontainer_path=.devcontainer.json&location=WestEurope)), and the fact that learning to solve problems on small datasets is a prerequisite to solving problems on huge datasets. +Analysis of 'big data' often involves extracting a small amount of data from a database for a specific statistical analysis. +Spatial databases\index{spatial database}, covered in Chapter \@ref(gis), can help with the analysis of datasets that do not fit in memory. +'Earth observation cloud back-ends' can be accessed from R with the **openeo** package, as described on the [openeo.org](https://openeo.org/) website. +We omitted detailed coverage of software for geographic analysis of big data with software such as [Apache Sedona](https://sedona.apache.org/) because the hardware and time costs of setting-up such systems are high, relative to their relative niche use cases. ## Getting help? {#questions}