Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mechanism for verifying that source code in documentation is valid #7956

Closed
wants to merge 28 commits into from
Closed
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,4 @@ rust-version = "1.70"

[dependencies]
datafusion = { path = "../datafusion/core", version = "32.0.0", default-features = false }
tokio = { version = "^1.0", features = ["rt-multi-thread"] }
44 changes: 37 additions & 7 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,31 +19,38 @@

# DataFusion Documentation

This folder contains the source content of the [User Guide](./source/user-guide)
and [Contributor Guide](./source/contributor-guide). These are both published to
https://arrow.apache.org/datafusion/ as part of the release process.
This folder contains the source content of the following guides:

- [User Guide]
- [Library Guide]
- [Contributor Guide]

These guides are published to https://arrow.apache.org/datafusion/ as part of the release process.

## Dependencies

It's recommended to install build dependencies and build the documentation
inside a Python virtualenv.

- Python
- `pip install -r requirements.txt`
Install Python and then use pip to install dependencies:

```shell
pip install -r requirements.txt
```

## Build & Preview

Run the provided script to build the HTML pages.

```bash
```shell
./build.sh
```

The HTML will be generated into a `build` directory.

Preview the site on Linux by running this command.

```bash
```shell
firefox build/html/index.html
```

Expand All @@ -53,6 +60,25 @@ To make changes to the docs, simply make a Pull Request with your
proposed changes as normal. When the PR is merged the docs will be
automatically updated.

## Including Source Code

We want to make sure that all source code in the documentation is tested as part of the build and release process. We
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if you considered the opposite:

Leaving the code example inlined in the documentation, and extracting it out to datafusion-docs-test with a script.

This would have the benefit that the markdown would be closer to the actual output documentation, and it might be easier to work on the examples inline.

The downside is that then we would need to somehow run a script to copy the examples into datafusion-docs-tests as part of CI to make sure they were up to date

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is more challenging because sometimes we are copying a whole function, and sometimes just a few lines of code, and some snippets depend on other snippets. It is also hard to write and test code with this approach because there is no IDE support.

achieve this by writing the code in standard Rust tests in the `datafusion-docs-test` crate, and annotate the code with
comments that mark the beginning and end of the portion of the code that we want to include in the documentation.

```rust
//begin:my_example
let foo = 1 + 1;
//end:my_example
```

We can now put an `include` directive in the markdown file, specifying the name of the Rust file containing the test
and the name of the example.

```md
<!-- include: my_rust_file::my_example -->
```

## Release Process

This documentation is hosted at https://arrow.apache.org/datafusion/
Expand All @@ -67,3 +93,7 @@ The Apache Software Foundation provides https://arrow.apache.org/,
which serves content based on the configuration in
[.asf.yaml](https://github.com/apache/arrow-datafusion/blob/main/.asf.yaml),
which specifies the target as https://arrow.apache.org/datafusion/.

[user guide]: ./source/user-guide
[library guide]: ./source/library-user-guide
[contributor guide]: ./source/contributor-guide
6 changes: 6 additions & 0 deletions docs/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,14 @@
set -e
rm -rf build 2> /dev/null
rm -rf temp 2> /dev/null

# copy source to temp dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a build.bat file in this directory that might need to get updated

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I had missed that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a convenient way to test on Windows, unfortunately

mkdir temp
cp -rf source/* temp/

# copy markdown files into temp dir again and insert source from tests
python preprocess.py

# replace relative URLs with absolute URLs
sed -i -e 's/\.\.\/\.\.\/\.\.\//https:\/\/github.com\/apache\/arrow-datafusion\/blob\/main\//g' temp/contributor-guide/index.md
make SOURCEDIR=`pwd`/temp html
92 changes: 92 additions & 0 deletions docs/preprocess.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
#!/usr/bin/python
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import glob
import os
import re


def copy_test_source(test_filename, test_method, output):
lines = []
with open(test_filename) as test:
in_example_code = False
for test_line in test.readlines():
if test_line.strip() == "//begin:{}".format(test_method):
in_example_code = True
continue
if test_line.strip() == "//end:{}".format(test_method):
break
if in_example_code:
lines.append(test_line)

# remove blank lines from the end of the list
while lines and lines[-1] == "":
lines.pop()

# remove leading indent when possible
consistent_indent = True
for line in lines:
if len(line.strip()) > 0 and not (
line.startswith(" ") or line.startswith("\t")
):
consistent_indent = False
break
if consistent_indent:
old_lines = lines
lines = []
for line in old_lines:
if len(line) >= 4:
lines.append(line[4:])
else:
lines.append(line)

# write to output
output.write("```rust\n")
for line in lines:
output.write(line)
output.write("```")


def add_source(input, output):
print("Copying", input, "to", output)
# <!-- include: library_logical_plan::plan_builder_1 -->
include_pattern = "<!-- include: ([a-z0-9_]*)::([a-z0-9_]*) -->"
with open(input, "r") as input:
with open(output, "w") as output:
for line in input.readlines():
matches = re.search(include_pattern, line)
if matches is not None:
test_file = matches.group(1)
test_method = matches.group(2)
test_filename = "src/{}.rs".format(test_file)
copy_test_source(test_filename, test_method, output)
else:
output.write(line)


def main():
for file in glob.glob("source/**/*.md"):
dest = "temp/" + file[7:]
last_path_sep = dest.rindex("/")
dir = dest[0:last_path_sep]
if not os.path.exists(dir):
os.makedirs(dir)
add_source(file, dest)


if __name__ == "__main__":
main()
54 changes: 5 additions & 49 deletions docs/source/library-user-guide/adding-udfs.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,57 +35,21 @@ First we'll talk about adding an Scalar UDF end-to-end, then we'll talk about th

A Scalar UDF is a function that takes a row of data and returns a single value. For example, this function takes a single i64 and returns a single i64 with 1 added to it:

```rust
use std::sync::Arc;

use arrow::array::{ArrayRef, Int64Array};
use datafusion::common::Result;

use datafusion::common::cast::as_int64_array;

pub fn add_one(args: &[ArrayRef]) -> Result<ArrayRef> {
// Error handling omitted for brevity

let i64s = as_int64_array(&args[0])?;

let new_array = i64s
.iter()
.map(|array_elem| array_elem.map(|value| value + 1))
.collect::<Int64Array>();

Ok(Arc::new(new_array))
}
```
<!-- include: library_udfs::add_one -->

For brevity, we'll skipped some error handling, but e.g. you may want to check that `args.len()` is the expected number of arguments.

This "works" in isolation, i.e. if you have a slice of `ArrayRef`s, you can call `add_one` and it will return a new `ArrayRef` with 1 added to each value.

```rust
let input = vec![Some(1), None, Some(3)];
let input = Arc::new(Int64Array::from(input)) as ArrayRef;

let result = add_one(&[input]).unwrap();
let result = result.as_any().downcast_ref::<Int64Array>().unwrap();

assert_eq!(result, &Int64Array::from(vec![Some(2), None, Some(4)]));
```
<!-- include: library_udfs::call_add_one -->

The challenge however is that DataFusion doesn't know about this function. We need to register it with DataFusion so that it can be used in the context of a query.

### Registering a Scalar UDF

To register a Scalar UDF, you need to wrap the function implementation in a `ScalarUDF` struct and then register it with the `SessionContext`. DataFusion provides the `create_udf` and `make_scalar_function` helper functions to make this easier.

```rust
let udf = create_udf(
"add_one",
vec![DataType::Int64],
Arc::new(DataType::Int64),
Volatility::Immutable,
make_scalar_function(add_one),
);
```
<!-- include: library_udfs::create_udf -->

A few things to note:

Expand All @@ -97,19 +61,11 @@ A few things to note:

That gives us a `ScalarUDF` that we can register with the `SessionContext`:

```rust
let mut ctx = SessionContext::new();

ctx.register_udf(udf);
```
<!-- include: library_udfs::register_udf -->

At this point, you can use the `add_one` function in your query:

```rust
let sql = "SELECT add_one(1)";

let df = ctx.sql(&sql).await.unwrap();
```
<!-- include: library_udfs::call_udf -->

## Adding a Window UDF

Expand Down
55 changes: 2 additions & 53 deletions docs/source/library-user-guide/building-logical-plans.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,35 +36,7 @@ much easier to use the [LogicalPlanBuilder], which is described in the next sect

Here is an example of building a logical plan directly:

<!-- source for this example is in datafusion_docs::library_logical_plan::plan_1 -->

```rust
// create a logical table source
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, true),
Field::new("name", DataType::Utf8, true),
]);
let table_source = LogicalTableSource::new(SchemaRef::new(schema));

// create a TableScan plan
let projection = None; // optional projection
let filters = vec![]; // optional filters to push down
let fetch = None; // optional LIMIT
let table_scan = LogicalPlan::TableScan(TableScan::try_new(
"person",
Arc::new(table_source),
projection,
filters,
fetch,
)?);

// create a Filter plan that evaluates `id > 500` that wraps the TableScan
let filter_expr = col("id").gt(lit(500));
let plan = LogicalPlan::Filter(Filter::try_new(filter_expr, Arc::new(table_scan))?);

// print the plan
println!("{}", plan.display_indent_schema());
```
<!-- include: library_logical_plan::create_plan -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested what happens if there is a bad link to an example here:

$ git diff
diff --git a/docs/source/library-user-guide/building-logical-plans.md b/docs/source/library-user-guide/building-logical-plans.md
index b75a788e83..6ae376d445 100644
--- a/docs/source/library-user-guide/building-logical-plans.md
+++ b/docs/source/library-user-guide/building-logical-plans.md
@@ -36,7 +36,7 @@ much easier to use the [LogicalPlanBuilder], which is described in the next sect

 Here is an example of building a logical plan directly:

-<!-- include: library_logical_plan::create_plan -->
+<!-- include: library_logical_plan::create_planXXXXX -->

 This example produces the following plan:

It seems to just leave a blank in the docs

Screenshot 2023-10-29 at 8 10 17 AM

Is there any way to generate an error in this case instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add some error checks.

Copy link
Contributor

@alamb alamb Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried this again and now the script simply doesn't change the section, which I think is fine


This example produces the following plan:

Expand Down Expand Up @@ -99,30 +71,7 @@ Here are some examples of transformation methods, but for a full list, refer to

The following example demonstrates building the same simple query plan as the previous example, with a table scan followed by a filter.

<!-- source for this example is in datafusion_docs::library_logical_plan::plan_builder_1 -->

```rust
// create a logical table source
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, true),
Field::new("name", DataType::Utf8, true),
]);
let table_source = LogicalTableSource::new(SchemaRef::new(schema));

// optional projection
let projection = None;

// create a LogicalPlanBuilder for a table scan
let builder = LogicalPlanBuilder::scan("person", Arc::new(table_source), projection)?;

// perform a filter operation and build the plan
let plan = builder
.filter(col("id").gt(lit(500)))? // WHERE id > 500
.build()?;

// print the plan
println!("{}", plan.display_indent_schema());
```
<!-- include: library_logical_plan::build_plan -->

This example produces the following plan:

Expand Down
3 changes: 1 addition & 2 deletions docs/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,5 @@
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

#[cfg(test)]
mod library_logical_plan;
mod library_udfs;
20 changes: 18 additions & 2 deletions docs/src/library_logical_plan.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,8 @@ use datafusion::prelude::*;
use std::sync::Arc;

#[test]
fn plan_1() -> Result<()> {
fn create_plan() -> Result<()> {
//begin:create_plan
// create a logical table source
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, true),
Expand All @@ -49,12 +50,20 @@ fn plan_1() -> Result<()> {

// print the plan
println!("{}", plan.display_indent_schema());
//end:create_plan

assert_eq!(
plan.display_indent_schema().to_string(),
r#"Filter: id > Int32(500) [id:Int32;N, name:Utf8;N]
TableScan: person [id:Int32;N, name:Utf8;N]"#
);

Ok(())
}

#[test]
fn plan_builder_1() -> Result<()> {
fn build_plan() -> Result<()> {
//begin:build_plan
// create a logical table source
let schema = Schema::new(vec![
Field::new("id", DataType::Int32, true),
Expand All @@ -73,6 +82,13 @@ fn plan_builder_1() -> Result<()> {

// print the plan
println!("{}", plan.display_indent_schema());
//end:build_plan

assert_eq!(
plan.display_indent_schema().to_string(),
r#"Filter: person.id > Int32(500) [id:Int32;N, name:Utf8;N]
TableScan: person [id:Int32;N, name:Utf8;N]"#
);

Ok(())
}
Loading