Skip to content

Commit

Permalink
Shorten and improve sentences mentioning "big data" (#911)
Browse files Browse the repository at this point in the history
  • Loading branch information
Robinlovelace committed Jan 26, 2023
1 parent 3013dc5 commit 97edb68
Show file tree
Hide file tree
Showing 2 changed files with 7 additions and 11 deletions.
2 changes: 1 addition & 1 deletion 10-gis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -858,7 +858,7 @@ But the same is true for the lightweight SQLite/SpatiaLite database engine and G

If your datasets are too big for PostgreSQL/PostGIS and you require massive spatial data management and query performance, it may be worth exploring large-scale geographic querying on distributed computing systems.
Such systems are outside the scope of this book but it worth mentioning that open source software providing this functionality exists.
Prominent projects in this space include [GeoMesa](http://www.geomesa.org/) and [Apache Sedona](https://sedona.apache.org/), formerly known as GeoSpark [@huang_geospark_2017], which has and R interface provided by the [**apache.sedona**](https://cran.r-project.org/package=apache.sedona) package.
Prominent projects in this space include [GeoMesa](http://www.geomesa.org/) and [Apache Sedona](https://sedona.apache.org/). The [**apache.sedona**](https://cran.r-project.org/package=apache.sedona) package provides an interace to the latter.

## Bridges to cloud technologies and services {#cloud}

Expand Down
16 changes: 6 additions & 10 deletions 16-synthesis.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -127,16 +127,12 @@ Instead of covering spatial statistical modeling and inference techniques, we fo
Again, the reason was that there are already excellent resources on these topics, especially with ecological use cases, including @zuur_mixed_2009, @zuur_beginners_2017 and freely available teaching material and code on *Geostatistics & Open-source Statistical Computing* by David Rossiter, hosted at [css.cornell.edu/faculty/dgr2](http://www.css.cornell.edu/faculty/dgr2/teach/) and the [*R for Geographic Data Science*](https://sdesabbata.github.io/r-for-geographic-data-science/) project by [Stefano De Sabbata](https://sdesabbata.github.io/) [at the University of Leicester](https://le.ac.uk/people/stefano-de-sabbata) for an introduction to R\index{R} for geographic data science\index{data science}.
There are also excellent resources on spatial statistics\index{spatial!statistics} using Bayesian modeling, a powerful framework for modeling and uncertainty estimation [@blangiardo_spatial_2015;@krainski_advanced_2018].

Finally, we have largely omitted big data\index{big data} analytics.
This might seem surprising since especially geographic data can become big really fast.
But the prerequisite for doing big data analytics is to know how to solve a problem on a small dataset.
Once you have learned that, you can apply the exact same techniques on big data questions, though of course you need to expand your toolbox.
The first thing to learn is to handle geographic data queries.
This is because big data\index{big data} analytics often boil down to extracting a small amount of data from a database for a specific statistical analysis.
For this, we have provided an introduction to spatial databases\index{spatial database} and how to use a GIS\index{GIS} from within R in Chapter \@ref(gis).
If you really have to do the analysis on a big or even the complete dataset, hopefully, the problem you are trying to solve is embarrassingly parallel.
For this, you need to learn a system that is able to do this parallelization efficiently such as [Apache Sedona](https://sedona.apache.org/), as mentioned in Section \@ref(postgis).
Regardless of dataset size, the techniques and concepts you have used on small datasets will be useful\index{big data} question, the only difference being the extra considterations when working in a big data setting.
We have largely omitted geocomputation on 'big data',\index{big data} by which we mean datasets that do not fit on consumer hardware or which cannot realistically be processed on a single CPU.
This decision is justified by the fact that the majority of geographic datasets that are needed for common research or policy applications *do* fit on consumer hardware, even if that may mean increasing the amount of RAM on your computer (or temporarily 'renting' compute power available on platforms such as [GitHub Codespaces](https://github.com/codespaces/new?hide_repo_select=true&ref=main&repo=84222786&machine=basicLinux32gb&devcontainer_path=.devcontainer.json&location=WestEurope)), and the fact that learning to solve problems on small datasets is a prerequisite to solving problems on huge datasets.
Analysis of 'big data' often involves extracting a small amount of data from a database for a specific statistical analysis.
Spatial databases\index{spatial database}, covered in Chapter \@ref(gis), can help with the analysis of datasets that do not fit in memory.
'Earth observation cloud back-ends' can be accessed from R with the **openeo** package, as described on the [openeo.org](https://openeo.org/) website.
We omitted detailed coverage of software for geographic analysis of big data with software such as [Apache Sedona](https://sedona.apache.org/) because the hardware and time costs of setting-up such systems are high, relative to their relative niche use cases.
<!-- TODO: add reference on big data?-->

## Getting help? {#questions}
Expand Down

0 comments on commit 97edb68

Please sign in to comment.