Shorten and improve sentences mentioning "big data" (#911)

geocompx · Jan 26, 2023 · 97edb68 · 97edb68
1 parent 3013dc5
commit 97edb68
Show file tree

Hide file tree

Showing 2 changed files with 7 additions and 11 deletions.
diff --git a/10-gis.Rmd b/10-gis.Rmd
@@ -858,7 +858,7 @@ But the same is true for the lightweight SQLite/SpatiaLite database engine and G
 
 If your datasets are too big for PostgreSQL/PostGIS and you require massive spatial data management and query performance, it may be worth exploring large-scale geographic querying on distributed computing systems.
 Such systems are outside the scope of this book but it worth mentioning that open source software providing this functionality exists.
-Prominent projects in this space include [GeoMesa](http://www.geomesa.org/) and [Apache Sedona](https://sedona.apache.org/), formerly known as GeoSpark [@huang_geospark_2017], which has and R interface provided by the [**apache.sedona**](https://cran.r-project.org/package=apache.sedona) package.
+Prominent projects in this space include [GeoMesa](http://www.geomesa.org/) and [Apache Sedona](https://sedona.apache.org/). The [**apache.sedona**](https://cran.r-project.org/package=apache.sedona) package provides an interace to the latter.
 
 ## Bridges to cloud technologies and services {#cloud}
 

diff --git a/16-synthesis.Rmd b/16-synthesis.Rmd
@@ -127,16 +127,12 @@ Instead of covering spatial statistical modeling and inference techniques, we fo
 Again, the reason was that there are already excellent resources on these topics, especially with ecological use cases, including @zuur_mixed_2009, @zuur_beginners_2017 and freely available teaching material and code on *Geostatistics & Open-source Statistical Computing* by David Rossiter, hosted at [css.cornell.edu/faculty/dgr2](http://www.css.cornell.edu/faculty/dgr2/teach/) and the [*R for Geographic Data Science*](https://sdesabbata.github.io/r-for-geographic-data-science/) project by [Stefano De Sabbata](https://sdesabbata.github.io/) [at the University of Leicester](https://le.ac.uk/people/stefano-de-sabbata) for an introduction to R\index{R} for geographic data science\index{data science}.
 There are also excellent resources on spatial statistics\index{spatial!statistics} using Bayesian modeling, a powerful framework for modeling and uncertainty estimation [@blangiardo_spatial_2015;@krainski_advanced_2018].
 
-Finally, we have largely omitted big data\index{big data} analytics.
-This might seem surprising since especially geographic data can become big really fast. 
-But the prerequisite for doing big data analytics is to know how to solve a problem on a small dataset.
-Once you have learned that, you can apply the exact same techniques on big data questions, though of course you need to expand your toolbox. 
-The first thing to learn is to handle geographic data queries.
-This is because big data\index{big data} analytics often boil down to extracting a small amount of data from a database for a specific statistical analysis.
-For this, we have provided an introduction to spatial databases\index{spatial database} and how to use a GIS\index{GIS} from within R in Chapter \@ref(gis).
-If you really have to do the analysis on a big or even the complete dataset, hopefully, the problem you are trying to solve is embarrassingly parallel.
-For this, you need to learn a system that is able to do this parallelization efficiently such as [Apache Sedona](https://sedona.apache.org/), as mentioned in Section \@ref(postgis).
-Regardless of dataset size, the techniques and concepts you have used on small datasets will be useful\index{big data} question, the only difference being the extra considterations when working in a big data setting.
+We have largely omitted geocomputation on 'big data',\index{big data} by which we mean datasets that do not fit on consumer hardware or which cannot realistically be processed on a single CPU.
+This decision is justified by the fact that the majority of geographic datasets that are needed for common research or policy applications *do* fit on consumer hardware, even if that may mean increasing the amount of RAM on your computer (or temporarily 'renting' compute power available on platforms such as [GitHub Codespaces](https://github.com/codespaces/new?hide_repo_select=true&ref=main&repo=84222786&machine=basicLinux32gb&devcontainer_path=.devcontainer.json&location=WestEurope)), and the fact that learning to solve problems on small datasets is a prerequisite to solving problems on huge datasets.
+Analysis of 'big data' often involves extracting a small amount of data from a database for a specific statistical analysis.
+Spatial databases\index{spatial database}, covered in Chapter \@ref(gis), can help with the analysis of datasets that do not fit in memory.
+'Earth observation cloud back-ends' can be accessed from R with the **openeo** package, as described on the [openeo.org](https://openeo.org/) website.
+We omitted detailed coverage of software for geographic analysis of big data with software such as [Apache Sedona](https://sedona.apache.org/) because the hardware and time costs of setting-up such systems are high, relative to their relative niche use cases.
 <!-- TODO: add reference on big data?-->
 
 ## Getting help? {#questions}