Skip to content

Commit

Permalink
8524 adding mechanism for storing tab. files with variable headers (#…
Browse files Browse the repository at this point in the history
…10282)

* "stored with header" flag #8524

* more changes for the streaming and redirect code. #8524

* disabling dynamically-generated varheader in the remaining storage drivers. #8524

* Ingest plugins (work in progress) #8524

* R ingest plugin (#8524)

* still some unaddressed @todo:s, but the branch should build and the unit tests should be passing. # 8524

* work-in-progress, on the subsetting code in the download instance writer. #8524

* more work-in-progress changes. removing all the unused code from TabularSubsetGenerator, for clarity etc. #8524

* more bits and pieces #8524

* 2 more ingest plugins. #8542

* Integration tests. #8524

* typo #8524

* documenting the new setting. #8524

* a release note for the pr. also, added the "storage quotas enabled" to the list of settings documented in the config guide while I was at it. #8524

* removed all the unused code from this class (lots of it) for clarity, etc. git history can be consulted if anyone is curious about what we used to do here. #8524

* removing @todo: that's no longer relevant #8524

* (cosmetic) defined the control constants used in the integration test. #8524
  • Loading branch information
landreev authored Feb 7, 2024
1 parent d944773 commit bec3945
Show file tree
Hide file tree
Showing 31 changed files with 501 additions and 1,335 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Tabular Data Ingest can now save the generated archival files with the list of variable names added as the first tab-delimited line. As the most significant effect of this feature,
Access API will be able to take advantage of Direct Download for tab. files saved with these headers on S3 - since they no longer have to be generated and added to the streamed content on the fly.

This behavior is controlled by the new setting `:StoreIngestedTabularFilesWithVarHeaders`. It is false by default, preserving the legacy behavior. When enabled, Dataverse will be able to handle both the newly ingested files, and any already-existing legacy files stored without these headers transparently to the user. E.g. the access API will continue delivering tab-delimited files **with** this header line, whether it needs to add it dynamically for the legacy files, or reading complete files directly from storage for the ones stored with it.

An API for converting existing legacy tabular files will be added separately. [this line will need to be changed if we have time to add said API before 6.2 is released].
22 changes: 22 additions & 0 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4151,3 +4151,25 @@ A true/false (default) option determining whether the dataset datafile table dis

.. _supported MicroProfile Config API source: https://docs.payara.fish/community/docs/Technical%20Documentation/MicroProfile/Config/Overview.html


.. _:UseStorageQuotas:

:UseStorageQuotas
+++++++++++++++++

Enables storage use quotas in collections. See the :doc:`/api/native-api` for details.


.. _:StoreIngestedTabularFilesWithVarHeaders:

:StoreIngestedTabularFilesWithVarHeaders
++++++++++++++++++++++++++++++++++++++++

With this setting enabled, tabular files produced during Ingest will
be stored with the list of variable names added as the first
tab-delimited line. As the most significant effect of this feature,
Access API will be able to take advantage of Direct Download for
tab. files saved with these headers on S3 - since they no longer have
to be generated and added to the streamed file on the fly.

The setting is ``false`` by default, preserving the legacy behavior.
18 changes: 18 additions & 0 deletions src/main/java/edu/harvard/iq/dataverse/DataTable.java
Original file line number Diff line number Diff line change
Expand Up @@ -112,6 +112,16 @@ public DataTable() {
@Column( nullable = true )
private String originalFileName;


/**
* The physical tab-delimited file is in storage with the list of variable
* names saved as the 1st line. This means that we do not need to generate
* this line on the fly. (Also means that direct download mechanism can be
* used for this file!)
*/
@Column(nullable = false)
private boolean storedWithVariableHeader = false;

/*
* Getter and Setter methods:
*/
Expand Down Expand Up @@ -206,6 +216,14 @@ public void setOriginalFileName(String originalFileName) {
this.originalFileName = originalFileName;
}

public boolean isStoredWithVariableHeader() {
return storedWithVariableHeader;
}

public void setStoredWithVariableHeader(boolean storedWithVariableHeader) {
this.storedWithVariableHeader = storedWithVariableHeader;
}

/*
* Custom overrides for hashCode(), equals() and toString() methods:
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,6 @@
import jakarta.ws.rs.ext.Provider;

import edu.harvard.iq.dataverse.DataFile;
import edu.harvard.iq.dataverse.FileMetadata;
import edu.harvard.iq.dataverse.dataaccess.*;
import edu.harvard.iq.dataverse.datavariable.DataVariable;
import edu.harvard.iq.dataverse.engine.command.Command;
Expand Down Expand Up @@ -104,8 +103,10 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
String auxiliaryTag = null;
String auxiliaryType = null;
String auxiliaryFileName = null;

// Before we do anything else, check if this download can be handled
// by a redirect to remote storage (only supported on S3, as of 5.4):

if (storageIO.downloadRedirectEnabled()) {

// Even if the above is true, there are a few cases where a
Expand Down Expand Up @@ -159,7 +160,7 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
}

} else if (dataFile.isTabularData()) {
// Many separate special cases here.
// Many separate special cases here.

if (di.getConversionParam() != null) {
if (di.getConversionParam().equals("format")) {
Expand All @@ -180,12 +181,26 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
redirectSupported = false;
}
}
} else if (!di.getConversionParam().equals("noVarHeader")) {
// This is a subset request - can't do.
} else if (di.getConversionParam().equals("noVarHeader")) {
// This will work just fine, if the tab. file is
// stored without the var. header. Throw "unavailable"
// exception otherwise.
// @todo: should we actually drop support for this "noVarHeader" flag?
if (dataFile.getDataTable().isStoredWithVariableHeader()) {
throw new ServiceUnavailableException();
}
// ... defaults to redirectSupported = true
} else {
// This must be a subset request then - can't do.
redirectSupported = false;
}
} else {
// "straight" download of the full tab-delimited file.
// can redirect, but only if stored with the variable
// header already added:
if (!dataFile.getDataTable().isStoredWithVariableHeader()) {
redirectSupported = false;
}
} else {
redirectSupported = false;
}
}
}
Expand Down Expand Up @@ -247,11 +262,16 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
// finally, issue the redirect:
Response response = Response.seeOther(redirect_uri).build();
logger.fine("Issuing redirect to the file location.");
// Yes, this throws an exception. It's not an exception
// as in, "bummer, something went wrong". This is how a
// redirect is produced here!
throw new RedirectionException(response);
}
throw new ServiceUnavailableException();
}

// Past this point, this is a locally served/streamed download

if (di.getConversionParam() != null) {
// Image Thumbnail and Tabular data conversion:
// NOTE: only supported on local files, as of 4.0.2!
Expand Down Expand Up @@ -285,9 +305,14 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
// request any tabular-specific services.

if (di.getConversionParam().equals("noVarHeader")) {
logger.fine("tabular data with no var header requested");
storageIO.setNoVarHeader(Boolean.TRUE);
storageIO.setVarHeader(null);
if (!dataFile.getDataTable().isStoredWithVariableHeader()) {
logger.fine("tabular data with no var header requested");
storageIO.setNoVarHeader(Boolean.TRUE);
storageIO.setVarHeader(null);
} else {
logger.fine("can't serve request for tabular data without varheader, since stored with it");
throw new ServiceUnavailableException();
}
} else if (di.getConversionParam().equals("format")) {
// Conversions, and downloads of "stored originals" are
// now supported on all DataFiles for which StorageIO
Expand Down Expand Up @@ -329,11 +354,10 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
if (variable.getDataTable().getDataFile().getId().equals(dataFile.getId())) {
logger.fine("adding variable id " + variable.getId() + " to the list.");
variablePositionIndex.add(variable.getFileOrder());
if (subsetVariableHeader == null) {
subsetVariableHeader = variable.getName();
} else {
subsetVariableHeader = subsetVariableHeader.concat("\t");
subsetVariableHeader = subsetVariableHeader.concat(variable.getName());
if (!dataFile.getDataTable().isStoredWithVariableHeader()) {
subsetVariableHeader = subsetVariableHeader == null
? variable.getName()
: subsetVariableHeader.concat("\t" + variable.getName());
}
} else {
logger.warning("variable does not belong to this data file.");
Expand All @@ -346,16 +370,29 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
try {
File tempSubsetFile = File.createTempFile("tempSubsetFile", ".tmp");
TabularSubsetGenerator tabularSubsetGenerator = new TabularSubsetGenerator();
tabularSubsetGenerator.subsetFile(storageIO.getInputStream(), tempSubsetFile.getAbsolutePath(), variablePositionIndex, dataFile.getDataTable().getCaseQuantity(), "\t");

long numberOfLines = dataFile.getDataTable().getCaseQuantity();
if (dataFile.getDataTable().isStoredWithVariableHeader()) {
numberOfLines++;
}

tabularSubsetGenerator.subsetFile(storageIO.getInputStream(),
tempSubsetFile.getAbsolutePath(),
variablePositionIndex,
numberOfLines,
"\t");

if (tempSubsetFile.exists()) {
FileInputStream subsetStream = new FileInputStream(tempSubsetFile);
long subsetSize = tempSubsetFile.length();

InputStreamIO subsetStreamIO = new InputStreamIO(subsetStream, subsetSize);
logger.fine("successfully created subset output stream.");
subsetVariableHeader = subsetVariableHeader.concat("\n");
subsetStreamIO.setVarHeader(subsetVariableHeader);

if (subsetVariableHeader != null) {
subsetVariableHeader = subsetVariableHeader.concat("\n");
subsetStreamIO.setVarHeader(subsetVariableHeader);
}

String tabularFileName = storageIO.getFileName();

Expand All @@ -380,8 +417,13 @@ public void writeTo(DownloadInstance di, Class<?> clazz, Type type, Annotation[]
} else {
logger.fine("empty list of extra arguments.");
}
// end of tab. data subset case
} else if (dataFile.getDataTable().isStoredWithVariableHeader()) {
logger.fine("tabular file stored with the var header included, no need to generate it on the fly");
storageIO.setNoVarHeader(Boolean.TRUE);
storageIO.setVarHeader(null);
}
}
} // end of tab. data file case

if (storageIO == null) {
//throw new WebApplicationException(Response.Status.SERVICE_UNAVAILABLE);
Expand Down
2 changes: 1 addition & 1 deletion src/main/java/edu/harvard/iq/dataverse/api/TestIngest.java
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ public String datafile(@QueryParam("fileName") String fileName, @QueryParam("fil
TabularDataIngest tabDataIngest = null;

try {
tabDataIngest = ingestPlugin.read(fileInputStream, null);
tabDataIngest = ingestPlugin.read(fileInputStream, false, null);
} catch (IOException ingestEx) {
output = output.concat("Caught an exception trying to ingest file " + fileName + ": " + ingestEx.getLocalizedMessage());
return output;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,7 +120,8 @@ public void open (DataAccessOption... options) throws IOException {
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())) {
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -450,8 +450,12 @@ public void open(DataAccessOption... options) throws IOException {
this.setSize(retrieveSizeFromMedia());
}
// Only applies for the S3 Connector case (where we could have run an ingest)
if (dataFile.getContentType() != null && dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData() && dataFile.getDataTable() != null && (!this.noVarHeader())) {
if (dataFile.getContentType() != null
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -124,8 +124,12 @@ public void open(DataAccessOption... options) throws IOException {
logger.fine("Setting size");
this.setSize(retrieveSizeFromMedia());
}
if (dataFile.getContentType() != null && dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData() && dataFile.getDataTable() != null && (!this.noVarHeader())) {
if (dataFile.getContentType() != null
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -225,7 +225,8 @@ public void open(DataAccessOption... options) throws IOException {
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())) {
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -142,7 +142,8 @@ public void open(DataAccessOption... options) throws IOException {
&& dataFile.getContentType().equals("text/tab-separated-values")
&& dataFile.isTabularData()
&& dataFile.getDataTable() != null
&& (!this.noVarHeader())) {
&& (!this.noVarHeader())
&& (!dataFile.getDataTable().isStoredWithVariableHeader())) {

List<DataVariable> datavariables = dataFile.getDataTable().getDataVariables();
String varHeaderLine = generateVariableHeader(datavariables);
Expand Down
Loading

0 comments on commit bec3945

Please sign in to comment.