Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(document-extractor): split h4 sections #4609

Closed
wants to merge 51 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
8422ac5
Add one more short circuit AND now that we split by h4 tags as well
tannerdolby Aug 2, 2021
45777ee
Add logic to document-extractor.js to split section by h2, h3 OR h4
tannerdolby Aug 2, 2021
7dfc273
Introduce isH4 and DisplayH4 to spec-section
tannerdolby Aug 2, 2021
3f97e7b
Introduce isH4 and DisplayH4 to lazy-bcd-table
tannerdolby Aug 2, 2021
6dbbb3d
Add logic to conditionally render DisplayH4 component
tannerdolby Aug 2, 2021
762ffad
Create DisplayH4 functional component in utils.tsx
tannerdolby Aug 2, 2021
bb79fd9
Add h4 selector to make sure they receive the same styles as other di…
tannerdolby Aug 2, 2021
530e09f
Refactor child.tagName conditional check
tannerdolby Aug 3, 2021
e192ecd
Use a set for the tagNames instead
tannerdolby Aug 3, 2021
f407799
Condense each DisplayH* function into a single DisplayHeading component
tannerdolby Aug 4, 2021
2135fc3
Render a single DisplayHeading component based on its level
tannerdolby Aug 4, 2021
745b975
Replace DisplayH2,H3,H4 component calls with a single DisplayHeading
tannerdolby Aug 4, 2021
51fb4e0
Merge branch 'main' of github.com:mdn/yari into 4370-split-section-by-h4
tannerdolby Aug 4, 2021
c1f4cdf
Comment out flaws.push for excess <h4> tags not at root-level
tannerdolby Aug 4, 2021
de93799
Merge branch 'main' of github.com:mdn/yari into 4370-split-section-by-h4
tannerdolby Aug 20, 2021
c8c0168
Create template for test fixture
tannerdolby Aug 20, 2021
f0d5dd1
Add jest tests
tannerdolby Aug 20, 2021
0054977
Merge branch 'main' of github.com:mdn/yari into 4370-split-section-by-h4
tannerdolby Sep 3, 2021
a754410
Merge branch 'main' of github.com:mdn/yari into 4370-split-section-by-h4
tannerdolby Nov 8, 2021
a079000
Merge branch 'main' into 4370-split-section-by-h4
tannerdolby May 14, 2022
bd35828
Merge branch 'main' of github.com:mdn/yari into 4370-split-section-by…
tannerdolby Jun 22, 2022
7306e90
Remove unused 'DisplayHeading' reference for now
tannerdolby Jun 22, 2022
436fbf2
Accept incoming changes from merge in spec-section.tsx
tannerdolby Jun 22, 2022
e29a4c9
Replace Set.has() with RegExp.test() for performance
tannerdolby Jun 22, 2022
044a454
Remove unused DisplayH2/H3 references
tannerdolby Jun 22, 2022
986af03
Use the DisplayHeading component
tannerdolby Jun 22, 2022
437eba8
Access isH4 and isH3 from the section
tannerdolby Jun 22, 2022
d0504d6
Merge branch 'main' into 4370-split-section-by-h4
Jun 25, 2022
3f7752b
Merge branch 'main' of github.com:mdn/yari into 4370-split-section-by-h4
tannerdolby Sep 4, 2022
34555a5
Merge branch '4370-split-section-by-h4' of github.com:tannerdolby/yar…
tannerdolby Sep 4, 2022
0561206
Refactor section splitting to reduce duplication
tannerdolby Sep 4, 2022
f249a3d
Add missing encoding parameter to fs.readFileSync
tannerdolby Sep 4, 2022
73476be
Add missing isH4 property to Section types
tannerdolby Sep 4, 2022
a294686
Add isH4 to the ProseSection value to match updated type definition
tannerdolby Sep 4, 2022
032b950
Remove .html() after extracting Cheerio element to variable
tannerdolby Sep 4, 2022
ccd7ffb
Experimenting with link matching for section splitting tests
tannerdolby Sep 4, 2022
a4efc65
Debugging getting the innerHTML from a specific h2/h3/h4 element
tannerdolby Sep 4, 2022
0a051fc
Add closure function to reduce duplication in heading extraction
tannerdolby Sep 4, 2022
08425eb
Remove stale function definition for extractHeadings
tannerdolby Sep 4, 2022
375e6cc
Debugging section splitting tests
tannerdolby Sep 4, 2022
c9af789
Testing section splitting
tannerdolby Sep 4, 2022
ddb074b
Tinkering with helper function
tannerdolby Sep 4, 2022
0d41c0d
Debugging
tannerdolby Sep 4, 2022
1107894
Fallback to original extraction logic
tannerdolby Sep 4, 2022
9fece47
Revert to original to debug
tannerdolby Sep 4, 2022
7a839c0
Replicate existing extraction logic for h4s
tannerdolby Sep 5, 2022
f89d8b1
Debugging
tannerdolby Sep 5, 2022
86c61b7
Testing
tannerdolby Sep 5, 2022
94f6e6e
Testing extraction behavior
tannerdolby Sep 5, 2022
4e50ec9
elements are getting removed, trying understand the flow better
tannerdolby Sep 5, 2022
1a11752
Comment out the h4 extraction for now
tannerdolby Sep 5, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
56 changes: 47 additions & 9 deletions build/document-extractor.ts
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ export function extractSections($: cheerio.CheerioAPI): [Section[], string[]] {
const iterable = [...(body.childNodes as cheerio.Element[])];

let c = 0;

iterable.forEach((child) => {
if (
(child as cheerio.Element).tagName === "h2" ||
Expand Down Expand Up @@ -303,11 +304,11 @@ function addSections($: cheerio.Cheerio<cheerio.Element>): SectionsAndFlaws {
const specialSections = _addSingleSpecialSection($);

// The _addSingleSpecialSection() function will have sucked up the <h2> or <h3>
// and the `div.bc-data` or `div.bc-specs` to turn it into a special section.
// or <h4> and the `div.bc-data` or `div.bc-specs` to turn it into a special section.
// First remove that, then put whatever HTML is left as a prose
// section underneath.
$.find("div.bc-data, h2, h3").remove();
$.find("div.bc-specs, h2, h3").remove();
$.find("div.bc-data, h2, h3, h4").remove();
$.find("div.bc-specs, h2, h3, h4").remove();
const [proseSections, proseFlaws] = _addSectionProse($);
specialSections.push(...proseSections);
flaws.push(...proseFlaws);
Expand All @@ -330,18 +331,23 @@ function _addSingleSpecialSection(
let id: string | null = null;
let title: string | null = null;
let isH3 = false;
let isH4 = false;

const h2s = $.find("h2");
const h3s = $.find("h3");
if (h2s.length === 1) {
id = h2s.attr("id");
title = h2s.text();
} else if (h3s.length === 1) {
id = h3s.attr("id");
title = h3s.text();
isH3 = true;
} else {
const h3s = $.find("h3");
if (h3s.length === 1) {
id = h3s.attr("id");
title = h3s.text();
isH3 = true;
}
// Look for <h4>s
const h4s = $.find("h4");
id = h4s.attr("id");
title = h4s.text();
isH4 = true;
}

let dataQuery: string = "";
Expand Down Expand Up @@ -381,6 +387,7 @@ function _addSingleSpecialSection(
title,
id,
isH3,
isH4,
data: null,
query,
browsers: null,
Expand All @@ -398,6 +405,7 @@ function _addSingleSpecialSection(
title,
id,
isH3,
isH4,
query,
specifications: [],
},
Expand Down Expand Up @@ -486,6 +494,7 @@ function _addSingleSpecialSection(
title,
id,
isH3,
isH4,
data,
query,
browsers,
Expand Down Expand Up @@ -666,6 +675,7 @@ function _addSingleSpecialSection(
title,
id,
isH3,
isH4,
specifications,
query,
},
Expand All @@ -681,13 +691,16 @@ function _addSectionProse(
let title: string | null = null;
let titleAsText: string = "";
let isH3 = false;
let isH4 = false;

const flaws: string[] = [];

// The way this works...
// Given a section of HTML, try to extract a id, title,

let h2found = false;
let h3found = false;

const h2s = $.find("h2");
h2s.each((i) => {
const h2 = h2s.eq(i);
Expand Down Expand Up @@ -730,9 +743,33 @@ function _addSectionProse(
h3.remove();
}
}
h3found = true;
});
}

// TODO: h4 elements are not being linkified and the
// tests are just picking up the initial text instead of the <a> tag
// const h4s = $.find("h4");
// h4s.each((i) => {
// const h4 = h4s.eq(i);
// if (i) {
// // Excess!
// flaws.push(
// `Excess <h4> tag that is NOT at root-level (id='${h4.attr(
// "id"
// )}', text='${h4.text()}')`
// );
// } else {
// id = h4.attr("id") ?? "";
// title = h4.html() ?? "";
// titleAsText = h4.text();
// if (id && title) {
// isH4 = true;
// h4.remove();
// }
// }
// });

if (id) {
// Remove trailing underscores (https://github.com/mdn/yari/issues/5492).
id = id.replace(/_+$/g, "");
Expand All @@ -742,6 +779,7 @@ function _addSectionProse(
id,
title,
isH3,
isH4,
content: $.html()?.trim(),
};

Expand Down
3 changes: 2 additions & 1 deletion build/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -253,7 +253,8 @@ function makeTOC(doc) {
section.type === "specifications") &&
section.value.id &&
section.value.title &&
!section.value.isH3
!section.value.isH3 &&
!section.value.isH4
) {
return { text: section.value.title, id: section.value.id };
}
Expand Down
7 changes: 3 additions & 4 deletions client/src/document/ingredients/prose.tsx
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { DisplayH2, DisplayH3 } from "./utils";
import { DisplayHeading } from "./utils";
tannerdolby marked this conversation as resolved.
Show resolved Hide resolved

export function Prose({ section }: { section: any }) {
const { id } = section;
Expand All @@ -14,11 +14,10 @@ export function Prose({ section }: { section: any }) {
return <Content />;
}

const DisplayHx = section.isH3 ? DisplayH3 : DisplayH2;

return (
<section aria-labelledby={id}>
<DisplayHx
<DisplayHeading
level={section.isH4 ? 4 : section.isH3 ? 3 : 2}
id={id}
title={section.title}
titleAsText={section.titleAsText}
Expand Down
7 changes: 4 additions & 3 deletions client/src/document/ingredients/spec-section.tsx
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
import { DisplayH2, DisplayH3 } from "./utils";
import { DisplayHeading } from "./utils";
import NoteCard from "../../ui/molecules/notecards";

export function SpecificationSection({
id,
title,
isH3,
isH4,
specifications,
query,
}: {
id: string;
title: string;
isH3: boolean;
isH4: boolean;
specifications: Array<{
title: string;
bcdSpecificationURL: string;
Expand All @@ -19,8 +21,7 @@ export function SpecificationSection({
}) {
return (
<>
{title && !isH3 && <DisplayH2 id={id} title={title} />}
{title && isH3 && <DisplayH3 id={id} title={title} />}
{<DisplayHeading level={isH4 ? 4 : isH3 ? 3 : 2} id={id} title={title} />}

{specifications.length > 0 ? (
<table className="standard-table">
Expand Down
27 changes: 7 additions & 20 deletions client/src/document/ingredients/utils.tsx
Original file line number Diff line number Diff line change
@@ -1,32 +1,19 @@
export function DisplayH2({
export function DisplayHeading({
level,
id,
title,
titleAsText,
}: {
level: number;
id: string;
title: string;
titleAsText?: string;
}) {
const Tag = `h${level}` as keyof JSX.IntrinsicElements;
return (
<h2 id={id.toLowerCase()}>
<Permalink title={title} titleAsText={titleAsText} id={id} />
</h2>
);
}

export function DisplayH3({
id,
title,
titleAsText,
}: {
id: string;
title: string;
titleAsText?: string;
}) {
return (
<h3 id={id.toLowerCase()}>
<Permalink title={title} titleAsText={titleAsText} id={id} />
</h3>
<Tag id={id.toLowerCase()}>
<Permalink id={id} title={title} titleAsText={titleAsText} />
</Tag>
);
}

Expand Down
8 changes: 5 additions & 3 deletions client/src/document/lazy-bcd-table.tsx
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import React, { lazy, Suspense, useEffect, useState } from "react";
import useSWR from "swr";

import { DisplayH2, DisplayH3 } from "./ingredients/utils";
import { DisplayHeading } from "./ingredients/utils";
import { Loading } from "../ui/atoms/loading";
// Because it's bad for web performance to lazy-load CSS during the initial render
// (because the page is saying "Wait! Stop rendering, now that I've downloaded
Expand All @@ -25,19 +25,21 @@ export function LazyBrowserCompatibilityTable({
id,
title,
isH3,
isH4,
query,
dataURL,
}: {
id: string;
title: string;
isH3: boolean;
isH4: boolean;
query: string;
dataURL: string | null;
}) {
return (
<>
{title && !isH3 && <DisplayH2 id={id} title={title} />}
{title && isH3 && <DisplayH3 id={id} title={title} />}
{<DisplayHeading level={isH4 ? 4 : isH3 ? 3 : 2} id={id} title={title} />}

{dataURL ? (
<LazyBrowserCompatibilityTableInner dataURL={dataURL} />
) : (
Expand Down
3 changes: 3 additions & 0 deletions libs/types/document.ts
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,7 @@ export interface ProseSection {
id: string | null;
title: string | null;
isH3: boolean;
isH4: boolean;
content?: string;
titleAsText?: string;
};
Expand All @@ -175,6 +176,7 @@ export interface SpecificationsSection {
id: string;
title: string;
isH3: boolean;
isH4: boolean;
query: string;
specifications: {
bcdSpecificationURL: any;
Expand All @@ -189,6 +191,7 @@ export interface BCDSection {
id: string;
title: string;
isH3: boolean;
isH4: boolean;
data?: BCD.Identifier | null;
dataURL?: string;
query: string;
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
---
title: Split section by heading
slug: Web/Split_section_by_heading
---

<p>This page contains a few h2, h3 and h4 tags at root-level that will be direct linkified.</p>


<h2>Some heading</h2>
<h4>Foo buzz</h4>
<h4 id="foo">Foo bar</h4>
<h3>Another heading</h3>
<div><p>stuff</p></div>
<h4>Final heading</h4>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be interesting to add a duplicate h4, i.e. a heading that also exists in another section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it would be a good scenario to test

40 changes: 40 additions & 0 deletions testing/tests/index.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -1767,6 +1767,46 @@ test("notecards are correctly transformed by the formatNotecards utility", () =>
);
});

test("sections should be split by h2, h3 or h4", () => {
const builtFolder = path.join(
buildRoot,
"en-us",
"docs",
"web",
"split_section_by_heading"
);

const jsonFile = path.join(builtFolder, "index.json");
const { doc } = JSON.parse(fs.readFileSync(jsonFile, "utf-8"));
expect(doc.flaws.length).toBeFalsy();
expect(doc.title).toBe("Split section by heading");

const htmlFile = path.join(builtFolder, "index.html");
const html = fs.readFileSync(htmlFile, "utf-8");
const $ = cheerio.load(html);

const h2 = $("h2#some_heading");
expect(h2.text()).toBe("Some heading");
expect(h2.attr("id")).toBe("some_heading");
expect(h2.html()).toBe(
'<a href="#some_heading" title="Permalink to Some heading">Some heading</a>'
);

const h3 = $("h3#another_heading");
expect(h3.text()).toBe("Another heading");
expect(h3.attr("id")).toBe("another_heading");
expect(h3.html()).toBe(
'<a href="#another_heading" title="Permalink to Another heading">Another heading</a>'
);

const h4 = $("h4#final_heading");
expect(h4.text()).toBe("Final heading");
expect(h4.attr("id")).toBe("final_heading");
expect(h4.html()).toBe(
'<a href="#final_heading" title="Permalink to Final heading">Final heading</a>'
);
});

test("homepage links and flaws", () => {
const builtFolder = path.join(
buildRoot,
Expand Down