archive/tar: re-add sparse file support

@dsnet

Hi @dsnet. Thanks for all the Go 1.10 archive/tar work. It's really an amazing amount of cleanup, and it's very well done.

The one change I'm uncomfortable with from an API point of view is the sparse hole support.

First, I worry that it's too complex to use. I get lost trying to read Example_sparseAutomatic - 99% of it seems to have nothing to do with sparse files - and I have a hard time believing that we expect clients to write all this code. Despite the name, nothing about the example strikes me as “automatic.”

Second, I worry that much of the functionality here does not belong in archive/tar. Tar files are not the only time that a client might care about where the holes are in a file or about creating a new file with holes, and yet somehow this functionality is expressed in terms of tar.Header and a new tar.SparseHole structure instead of tar-independent operations. Tar should especially not be importing and using such subtle bits of syscall as it is in sparse_windows.go.

It's too late to redesign this for Go 1.10, so I suggest we pull out this new API and revisit for Go 1.11.

For Go 1.11, I would suggest to investigate (1) what an appropriate API in package os would be, and (2) how to make archive/tar take advantage of that more automatically.

For example, perhaps it would make sense for package os to add

// Regions returns the boundaries of data and hole regions in the file.
// The result slice can be read as pairs of offsets indicating the location
// of initialized data in the file or, ignoring the first and last element,
// as pairs of offsets indicating the location of a hole in the file.
// The first element of the result is always 0, and the last element is
// always the size of the file.
// For example, if f is a 4-kilobyte file with data written only to the
// first and last kilobyte (and therefore a 2-kilobyte hole in the middle),
// Regions would return [0, 1024, 3072, 4096].
//
// On operating systems that do not support files with holes or do
// not support querying the location of holes in files,
// Regions returns [0, size].
//
// Regions may temporarily change the file offset, so it should not
// be executed in parallel with Read or Write operations.
func (f *File) Regions() ([]int64, error)

That would avoid archive/tar's current DetectParseHoles and SparseEntry, and the tar.Header only need to add a new field Regions []int64. (Regions is not a great name; better names are welcome.) Note that using a simple slice of offsets avoids the need for a special invertSparseEntries function entirely: you just change whether you read pairs starting at offset 0 or 1.

As for "punching holes", it suffices on Unix (as you know) to simply truncate the file (which Create does anyway) and then not write to the holes. On Windows it appears to be necessary to set the file type to sparse, but I don't see why the rest of sparsePunchWindows is needed. It seems crazy to me that it could possibly be necessary to pre-declare every hole location in a fresh file. The FSCTL_SET_ZERO_DATA looks like it is for making a hole in an existing file, not a new file. It seems like it should suffice to truncate the target file, mark it as sparse, set the file size, and then write the data. What's left should be automatically inferred as holes. If we were to add a new method SetSparse(bool) to os.File, then I would expect it to work on all systems to do something like:

f = Create(file)
f.SetSparse(true) // no-op on non-Windows systems, FSCTL_SET_SPARSE (only) on Windows
for each data chunk {
	f.WriteAt(data, offset)
}
f.Truncate(targetSize) // in case of final hole, or write last byte of file

Finally, it seems like handling this should not be the responsibility of every client of archive/tar. It seems like it would be better for this to just work automatically.

On the tar.Reader side, WriteTo already takes care of not writing to holes. It could also call SetSparse and use Truncate if present as an alternative to writing the last byte of the file.

On the tar.Writer side, I think ReadFrom could also take care of this. It would require making WriteHeader compute the header to be written to the file but delay the actual writing until the Write or ReadFrom call. (And that in turn might make Flush worth keeping around not-deprecated.) Then when ReadFrom is called to read from a file with holes, it could find the holes and add that information to the header before writing out the header. Both of those combined would make this actually automatic.

At the very least, it seems clear that the current API steps beyond what tar should be responsible for. I can easily see developers who need to deal with sparse files but have no need for tar files constructing fake tar headers just to use DetectSparseHoles and PunchSparseHoles. That's a strong signal that this functionality does not belong in tar as the primary implementation. (A weaker but still important signal is that to date the tar.Header fields and methods have not mentioned os.File explicitly, and it should probably stay that way.)

Let's remove this from Go 1.10 and revisit in Go 1.11. Concretely, let's remove tar.SparseEntry, tar.Header.SparseHoles, tar.Header.DetectSparseHoles, tar.Header.PunchSparseHoles, and the deprecation notice for tar.Writer.Flush.

Thanks again.
Russ

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

archive/tar: re-add sparse file support #22735

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

archive/tar: re-add sparse file support #22735

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions