Discussion:
[filesystem] proposal: treat reparse files as regular files
(too old to reply)
Paul Harris
2015-07-24 08:03:17 UTC
Permalink
Hi all,

-- Proposal --

tl;dr : I propose that we treat all non-symlink "reparse_files" as
"regular_files".

If the boost library user wants to do something special with these plain
reparse files, they should use alternative means. But typically they are
supposed to be treated as regular files.

This means we could drop the "reparse_file" enum, or continue to use it for
a special-case whats_my_real_status() function.


--- Motivation ---

Windows Server 2012 uses reparse points to implement deduplification.
Those files should be treated as regular files in all circumstances.
Currently, they are not classed as "regular" files, so fs::copy() will skip
those files,
and library-user code written to list files based on official examples will
ignore all dedup'd files.

This is causing serious and latent problems at the user end, because
deduping only happens occasionally after X days, and users cannot easily
check if a file is dedup'd (they look just like regular files).


--- Real life example ---

Another example of reparse use is the "Symantec Enterprise Vault" (version
10), which I found running on one site.
It replaces files on the server with reparse-point files.
FSUTIL REPARSEPOINT QUERY filename.txt
shows the contents of the reparse buffer, which is a URL to an internal
HTTP server. The url points to a .asp link with a bunch of codes and dates
to identify the file in the server.
Copy-pasting that URL into a webbrowser allows you to directly download the
file via the webbrowser, which is pretty neat I suppose.

In this case, the reparsed-files in Windows Explorer all have grey X
crosses on their file icon. If you "type" them (via cmd) or open them, the
icon loses the grey cross and the file is no longer a reparse point file.

My software refused to read the files because they were "not regular
files". Once I adjusted the boost code (described below), my software saw
them as regular and opened the files. The file icons lost the grey cross.

SO it seems that the file server automatically downloads and replaces the
files with the stored content on demand, and the file reading client
program should really just treat these files as normal files.


--- Short logic ---

reparse files (that are not symlinks) should almost always be treated as
plain files.
They are a mechanism for MS file servers to store files in clever ways, but
the client should not care and just read/write them as if they were normal
files.

This is different to all the other "other" files which can't be treated
like normal files:
block, character, fifo, socket, unknown

So, reparse files should not be grouped with the "other" file types.

They are also NOT symlinks, and should not be treated as symlinks (which
would require special decisions for copying, or querying the status, or
checking if the target still exists).


--- What are reparse files ---

I did some reading, if I understand correctly:

Reparse points give drivers (on the server) a chance to get data through
some other specialised means (eg query from a cluster store).
They are processed by the server, not the client, so clients should treat
reparse data as opaque data.
EXCEPT for symlink reparse files.

https://msdn.microsoft.com/en-us/library/dd541667.aspx

quote:"The following reparse tags, with the exception of
IO_REPARSE_TAG_SYMLINK, are processed on the server and are not processed
by a client after transmission over the wire. Clients should treat
associated reparse data as opaque data."

It seems like the rest of the tags are used for connecting files to other
types of storage (eg long term storage, cluster storage).
Clients may need to do something special with SOME reparse point files, IF
the client cares about how long the file read may take.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365505(v=vs.85).aspx
quote: "Most applications should take special actions for files that have
been moved to long-term storage, if only to notify the user that it may
take a while to retrieve the file."


--- Changes required ---

Option 1: change is_regular_file() to return true where type==reparse_file
I don't like this option, as library-users could be checking the type
directly instead of using is_regular_file().


Option 2:
These functions return reparse_file:

fs::file_type query_file_type(const path& p, error_code* ec)
file_status status(const path& p, error_code* ec)
file_status symlink_status(const path& p, error_code* ec)

They should instead return regular_file instead.


--- How to test with dedup files ---

Creating dedup'd files is a feature only available on Windows Server 2012,
I believe,
although Windows XP/Vista/7/8/10 clients all can read dedup files.

Here is how I created a windows server to test with (for free!) on a demo
Azure cloud server.
I have one working, so if anyone would like to use it for their testing,
let me know.

Step one: follow this blog article:
http://blogs.technet.com/b/tommypatterson/p/azureservertrial.aspx

once the machine was "running" I clicked Connect at the bottom.
That gave me an .rdp file which in theory I could use with rdesktop, but it
uses a DNS name that was only just created, so that didn't work.

When you click the name of the server in the list, it shows the public IP
on the right.. and the port
then you can do this
$ rdesktop that.ip.addr:port

But only if you have the latest rdesktop AND you have set up kerberos
something-something.

Instead I found a windows computer and used remote desktop from there.


---

Once inside,
in the "Server Manager --> Dashboard" window on the screen, click "Add
Roles"
then go next next until "Server Roles"
expand "File and Storage services" , "File and iSCSI" , and tick "Data
Deduplication"
Then next next etc and Install.
Wait a bit... and its done.
http://www.techrepublic.com/blog/data-center/configuring-windows-server-8-deduplication/

---

Continuing on that webpage...
Time to enable dedup. There is a temp disk D: so lets enable there.

Method 1... I did this and then went to method 2... Start PowerShell, type:
"Enable-DedupVolume D:"

Method 2... in that same Dashboard, hit the 4th button (File and Storage
Services)
Then Volumes --> Disks
click Volume 1 at the top, and then right click D: at the bottom -->
Configure Dedup.

To try and accelerate this puppy, I set the "age to dedup" to 0 days.

http://www.techrepublic.com/blog/data-center/windows-server-2012-deduplication-how-and-where-to-tweak/

---

Time to make something to dedup. We'll just duplicate the warning.txt file
that exists on D:

In powershell:
PS> D:
PS> $file = Get-Content DATALOSS_WARNING_README.txt

Then, do these 2 commands a bunch of times until "big.txt" gets to say 6MB
PS> Add-Content big.txt $file
PS> $file = Get-Content big.txt

Then use windows explorer (or other) to make a dozen copies of big.txt


Copy c:\windows\explorer.exe to D:
to give it something to dedup
Go to D: and then copy-paste explorer.exe a dozen times.

In PowerShell, type:
PS> Update-DedupStatus -Volume D:
PS> Start-DedupStatus -Type Optimization -Volume D:

and then wait for it to finish.
you can track its progress with:
PS> Get-DedupJob
PS> Get-DedupStatus -Volume D:

---

So, once its deduped, you check.
PS> FSUTIL REPARSEPOINT QUERY big.txt
you should see that its a reparse point with that 0x800etc0013 code.

Copy-paste big.txt to big2.txt and check it with the query, and it should
tell you big2 is NOT a reparse point.


NOW you have some files to test the boost library...
You can't zip them up (they lose the dedup tag), you have to run boost
binaries ON the computer in the sky.


--- Finish ---

Thanks for reading,
Paul

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-07-24 16:56:49 UTC
Permalink
On 24 Jul 2015 at 16:03, Paul Harris wrote:

> tl;dr : I propose that we treat all non-symlink "reparse_files" as
> "regular_files".
>
> If the boost library user wants to do something special with these plain
> reparse files, they should use alternative means. But typically they are
> supposed to be treated as regular files.
>
> This means we could drop the "reparse_file" enum, or continue to use it for
> a special-case whats_my_real_status() function.
>
>
> --- Motivation ---
>
> Windows Server 2012 uses reparse points to implement deduplification.
> Those files should be treated as regular files in all circumstances.
> Currently, they are not classed as "regular" files, so fs::copy() will skip
> those files,
> and library-user code written to list files based on official examples will
> ignore all dedup'd files.
>
> This is causing serious and latent problems at the user end, because
> deduping only happens occasionally after X days, and users cannot easily
> check if a file is dedup'd (they look just like regular files).
>
>
> --- Real life example ---
>
> Another example of reparse use is the "Symantec Enterprise Vault" (version
> 10), which I found running on one site.
> It replaces files on the server with reparse-point files.
> FSUTIL REPARSEPOINT QUERY filename.txt
> shows the contents of the reparse buffer, which is a URL to an internal
> HTTP server. The url points to a .asp link with a bunch of codes and dates
> to identify the file in the server.
> Copy-pasting that URL into a webbrowser allows you to directly download the
> file via the webbrowser, which is pretty neat I suppose.
>
> In this case, the reparsed-files in Windows Explorer all have grey X
> crosses on their file icon. If you "type" them (via cmd) or open them, the
> icon loses the grey cross and the file is no longer a reparse point file.
>
> My software refused to read the files because they were "not regular
> files". Once I adjusted the boost code (described below), my software saw
> them as regular and opened the files. The file icons lost the grey cross.
>
> SO it seems that the file server automatically downloads and replaces the
> files with the stored content on demand, and the file reading client
> program should really just treat these files as normal files.
>
>
> --- Short logic ---
>
> reparse files (that are not symlinks) should almost always be treated as
> plain files.
> They are a mechanism for MS file servers to store files in clever ways, but
> the client should not care and just read/write them as if they were normal
> files.
>
> This is different to all the other "other" files which can't be treated
> like normal files:
> block, character, fifo, socket, unknown
>
> So, reparse files should not be grouped with the "other" file types.
>
> They are also NOT symlinks, and should not be treated as symlinks (which
> would require special decisions for copying, or querying the status, or
> checking if the target still exists).
>
>
> --- What are reparse files ---
>
> I did some reading, if I understand correctly:
>
> Reparse points give drivers (on the server) a chance to get data through
> some other specialised means (eg query from a cluster store).
> They are processed by the server, not the client, so clients should treat
> reparse data as opaque data.
> EXCEPT for symlink reparse files.
>
> https://msdn.microsoft.com/en-us/library/dd541667.aspx
>
> quote:"The following reparse tags, with the exception of
> IO_REPARSE_TAG_SYMLINK, are processed on the server and are not processed
> by a client after transmission over the wire. Clients should treat
> associated reparse data as opaque data."
>
> It seems like the rest of the tags are used for connecting files to other
> types of storage (eg long term storage, cluster storage).
> Clients may need to do something special with SOME reparse point files, IF
> the client cares about how long the file read may take.
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa365505(v=vs.85).aspx
> quote: "Most applications should take special actions for files that have
> been moved to long-term storage, if only to notify the user that it may
> take a while to retrieve the file."
>
>
> --- Changes required ---
>
> Option 1: change is_regular_file() to return true where type==reparse_file
> I don't like this option, as library-users could be checking the type
> directly instead of using is_regular_file().
>
>
> Option 2:
> These functions return reparse_file:
>
> fs::file_type query_file_type(const path& p, error_code* ec)
> file_status status(const path& p, error_code* ec)
> file_status symlink_status(const path& p, error_code* ec)
>
> They should instead return regular_file instead.
>
>
> --- How to test with dedup files ---
>
> Creating dedup'd files is a feature only available on Windows Server 2012,
> I believe,
> although Windows XP/Vista/7/8/10 clients all can read dedup files.
>
> Here is how I created a windows server to test with (for free!) on a demo
> Azure cloud server.
> I have one working, so if anyone would like to use it for their testing,
> let me know.
>
> Step one: follow this blog article:
> http://blogs.technet.com/b/tommypatterson/p/azureservertrial.aspx
>
> once the machine was "running" I clicked Connect at the bottom.
> That gave me an .rdp file which in theory I could use with rdesktop, but it
> uses a DNS name that was only just created, so that didn't work.
>
> When you click the name of the server in the list, it shows the public IP
> on the right.. and the port
> then you can do this
> $ rdesktop that.ip.addr:port
>
> But only if you have the latest rdesktop AND you have set up kerberos
> something-something.
>
> Instead I found a windows computer and used remote desktop from there.
>
>
> ---
>
> Once inside,
> in the "Server Manager --> Dashboard" window on the screen, click "Add
> Roles"
> then go next next until "Server Roles"
> expand "File and Storage services" , "File and iSCSI" , and tick "Data
> Deduplication"
> Then next next etc and Install.
> Wait a bit... and its done.
> http://www.techrepublic.com/blog/data-center/configuring-windows-server-8-deduplication/
>
> ---
>
> Continuing on that webpage...
> Time to enable dedup. There is a temp disk D: so lets enable there.
>
> Method 1... I did this and then went to method 2... Start PowerShell, type:
> "Enable-DedupVolume D:"
>
> Method 2... in that same Dashboard, hit the 4th button (File and Storage
> Services)
> Then Volumes --> Disks
> click Volume 1 at the top, and then right click D: at the bottom -->
> Configure Dedup.
>
> To try and accelerate this puppy, I set the "age to dedup" to 0 days.
>
> http://www.techrepublic.com/blog/data-center/windows-server-2012-deduplication-how-and-where-to-tweak/
>
> ---
>
> Time to make something to dedup. We'll just duplicate the warning.txt file
> that exists on D:
>
> In powershell:
> PS> D:
> PS> $file = Get-Content DATALOSS_WARNING_README.txt
>
> Then, do these 2 commands a bunch of times until "big.txt" gets to say 6MB
> PS> Add-Content big.txt $file
> PS> $file = Get-Content big.txt
>
> Then use windows explorer (or other) to make a dozen copies of big.txt
>
>
> Copy c:\windows\explorer.exe to D:
> to give it something to dedup
> Go to D: and then copy-paste explorer.exe a dozen times.
>
> In PowerShell, type:
> PS> Update-DedupStatus -Volume D:
> PS> Start-DedupStatus -Type Optimization -Volume D:
>
> and then wait for it to finish.
> you can track its progress with:
> PS> Get-DedupJob
> PS> Get-DedupStatus -Volume D:
>
> ---
>
> So, once its deduped, you check.
> PS> FSUTIL REPARSEPOINT QUERY big.txt
> you should see that its a reparse point with that 0x800etc0013 code.
>
> Copy-paste big.txt to big2.txt and check it with the query, and it should
> tell you big2 is NOT a reparse point.
>
>
> NOW you have some files to test the boost library...
> You can't zip them up (they lose the dedup tag), you have to run boost
> binaries ON the computer in the sky.
>
>
> --- Finish ---
>
> Thanks for reading,
> Paul

I appreciate all the detail, and I'm sure so does Beman who is
Filesystem's maintainer.

However, they all still look like symlinks to me. Just because the OS
magically replaces them with the real file on first access is
immaterial - the same thing could happen on Linux. If you don't treat
them as symlinks, there is no way of inspecting the link without
causing it to be auto-downloaded which could be catastrophic in some
use cases.

I still vote for pseudo-symlinks to be reported by Filesystem as
symlinks.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Paul Harris
2015-07-27 02:55:35 UTC
Permalink
On 25 July 2015 at 00:56, Niall Douglas <***@nedprod.com> wrote:

> On 24 Jul 2015 at 16:03, Paul Harris wrote:
>
> > tl;dr : I propose that we treat all non-symlink "reparse_files" as
> > "regular_files".
> >
> > If the boost library user wants to do something special with these plain
> > reparse files, they should use alternative means. But typically they are
> > supposed to be treated as regular files.
> >
> > This means we could drop the "reparse_file" enum, or continue to use it
> for
> > a special-case whats_my_real_status() function.
> >
> >
> > --- Motivation ---
> >
> > Windows Server 2012 uses reparse points to implement deduplification.
> > Those files should be treated as regular files in all circumstances.
> > Currently, they are not classed as "regular" files, so fs::copy() will
> skip
> > those files,
> > and library-user code written to list files based on official examples
> will
> > ignore all dedup'd files.
> >
> > This is causing serious and latent problems at the user end, because
> > deduping only happens occasionally after X days, and users cannot easily
> > check if a file is dedup'd (they look just like regular files).
> >
> >
> > --- Real life example ---
> >
> > Another example of reparse use is the "Symantec Enterprise Vault"
> (version
> > 10), which I found running on one site.
> > It replaces files on the server with reparse-point files.
> > FSUTIL REPARSEPOINT QUERY filename.txt
> > shows the contents of the reparse buffer, which is a URL to an internal
> > HTTP server. The url points to a .asp link with a bunch of codes and
> dates
> > to identify the file in the server.
> > Copy-pasting that URL into a webbrowser allows you to directly download
> the
> > file via the webbrowser, which is pretty neat I suppose.
> >
> > In this case, the reparsed-files in Windows Explorer all have grey X
> > crosses on their file icon. If you "type" them (via cmd) or open them,
> the
> > icon loses the grey cross and the file is no longer a reparse point file.
> >
> > My software refused to read the files because they were "not regular
> > files". Once I adjusted the boost code (described below), my software
> saw
> > them as regular and opened the files. The file icons lost the grey
> cross.
> >
> > SO it seems that the file server automatically downloads and replaces the
> > files with the stored content on demand, and the file reading client
> > program should really just treat these files as normal files.
> >
> >
> > --- Short logic ---
> >
> > reparse files (that are not symlinks) should almost always be treated as
> > plain files.
> > They are a mechanism for MS file servers to store files in clever ways,
> but
> > the client should not care and just read/write them as if they were
> normal
> > files.
> >
> > This is different to all the other "other" files which can't be treated
> > like normal files:
> > block, character, fifo, socket, unknown
> >
> > So, reparse files should not be grouped with the "other" file types.
> >
> > They are also NOT symlinks, and should not be treated as symlinks (which
> > would require special decisions for copying, or querying the status, or
> > checking if the target still exists).
> >
> >
> > --- What are reparse files ---
> >
> > I did some reading, if I understand correctly:
> >
> > Reparse points give drivers (on the server) a chance to get data through
> > some other specialised means (eg query from a cluster store).
> > They are processed by the server, not the client, so clients should treat
> > reparse data as opaque data.
> > EXCEPT for symlink reparse files.
> >
> > https://msdn.microsoft.com/en-us/library/dd541667.aspx
> >
> > quote:"The following reparse tags, with the exception of
> > IO_REPARSE_TAG_SYMLINK, are processed on the server and are not processed
> > by a client after transmission over the wire. Clients should treat
> > associated reparse data as opaque data."
> >
> > It seems like the rest of the tags are used for connecting files to other
> > types of storage (eg long term storage, cluster storage).
> > Clients may need to do something special with SOME reparse point files,
> IF
> > the client cares about how long the file read may take.
> >
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa365505(v=vs.85).aspx
> > quote: "Most applications should take special actions for files that have
> > been moved to long-term storage, if only to notify the user that it may
> > take a while to retrieve the file."
> >
> >
> > --- Changes required ---
> >
> > Option 1: change is_regular_file() to return true where
> type==reparse_file
> > I don't like this option, as library-users could be checking the type
> > directly instead of using is_regular_file().
> >
> >
> > Option 2:
> > These functions return reparse_file:
> >
> > fs::file_type query_file_type(const path& p, error_code* ec)
> > file_status status(const path& p, error_code* ec)
> > file_status symlink_status(const path& p, error_code* ec)
> >
> > They should instead return regular_file instead.
> >
> >
> > --- How to test with dedup files ---
> >
> > Creating dedup'd files is a feature only available on Windows Server
> 2012,
> > I believe,
> > although Windows XP/Vista/7/8/10 clients all can read dedup files.
> >
> > Here is how I created a windows server to test with (for free!) on a demo
> > Azure cloud server.
> > I have one working, so if anyone would like to use it for their testing,
> > let me know.
> >
> > Step one: follow this blog article:
> > http://blogs.technet.com/b/tommypatterson/p/azureservertrial.aspx
> >
> > once the machine was "running" I clicked Connect at the bottom.
> > That gave me an .rdp file which in theory I could use with rdesktop, but
> it
> > uses a DNS name that was only just created, so that didn't work.
> >
> > When you click the name of the server in the list, it shows the public IP
> > on the right.. and the port
> > then you can do this
> > $ rdesktop that.ip.addr:port
> >
> > But only if you have the latest rdesktop AND you have set up kerberos
> > something-something.
> >
> > Instead I found a windows computer and used remote desktop from there.
> >
> >
> > ---
> >
> > Once inside,
> > in the "Server Manager --> Dashboard" window on the screen, click "Add
> > Roles"
> > then go next next until "Server Roles"
> > expand "File and Storage services" , "File and iSCSI" , and tick "Data
> > Deduplication"
> > Then next next etc and Install.
> > Wait a bit... and its done.
> >
> http://www.techrepublic.com/blog/data-center/configuring-windows-server-8-deduplication/
> >
> > ---
> >
> > Continuing on that webpage...
> > Time to enable dedup. There is a temp disk D: so lets enable there.
> >
> > Method 1... I did this and then went to method 2... Start PowerShell,
> type:
> > "Enable-DedupVolume D:"
> >
> > Method 2... in that same Dashboard, hit the 4th button (File and Storage
> > Services)
> > Then Volumes --> Disks
> > click Volume 1 at the top, and then right click D: at the bottom -->
> > Configure Dedup.
> >
> > To try and accelerate this puppy, I set the "age to dedup" to 0 days.
> >
> >
> http://www.techrepublic.com/blog/data-center/windows-server-2012-deduplication-how-and-where-to-tweak/
> >
> > ---
> >
> > Time to make something to dedup. We'll just duplicate the warning.txt
> file
> > that exists on D:
> >
> > In powershell:
> > PS> D:
> > PS> $file = Get-Content DATALOSS_WARNING_README.txt
> >
> > Then, do these 2 commands a bunch of times until "big.txt" gets to say
> 6MB
> > PS> Add-Content big.txt $file
> > PS> $file = Get-Content big.txt
> >
> > Then use windows explorer (or other) to make a dozen copies of big.txt
> >
> >
> > Copy c:\windows\explorer.exe to D:
> > to give it something to dedup
> > Go to D: and then copy-paste explorer.exe a dozen times.
> >
> > In PowerShell, type:
> > PS> Update-DedupStatus -Volume D:
> > PS> Start-DedupStatus -Type Optimization -Volume D:
> >
> > and then wait for it to finish.
> > you can track its progress with:
> > PS> Get-DedupJob
> > PS> Get-DedupStatus -Volume D:
> >
> > ---
> >
> > So, once its deduped, you check.
> > PS> FSUTIL REPARSEPOINT QUERY big.txt
> > you should see that its a reparse point with that 0x800etc0013 code.
> >
> > Copy-paste big.txt to big2.txt and check it with the query, and it should
> > tell you big2 is NOT a reparse point.
> >
> >
> > NOW you have some files to test the boost library...
> > You can't zip them up (they lose the dedup tag), you have to run boost
> > binaries ON the computer in the sky.
> >
> >
> > --- Finish ---
> >
> > Thanks for reading,
> > Paul
>
> I appreciate all the detail, and I'm sure so does Beman who is
> Filesystem's maintainer.
>
> However, they all still look like symlinks to me. Just because the OS
> magically replaces them with the real file on first access is
> immaterial - the same thing could happen on Linux. If you don't treat
> them as symlinks, there is no way of inspecting the link without
> causing it to be auto-downloaded which could be catastrophic in some
> use cases.
>
> I still vote for pseudo-symlinks to be reported by Filesystem as
> symlinks.
>


I did think about that, but the design of these reparse points intends for
these files to be treated as plain files by the client - as per MS
documents.

Plus, I understand it as: the reparse buffer is entirely driver-specific,
and so you can't expect boost or any user program to be able to decode what
is inside the reparse buffer and do anything intelligent. AND the
resolving is done by the driver on the server side. Note that there are
probably a dozen products out there that use these reparse buffers for
their storage solution... its not just windows dedup.

So, I don't see how the client can't do anything intelligent with symlink
knowledge,
AND if boost library users are forced to treat them as symlinks, then you
now have 2 kinds of symlinks:

1) standard symlink, which you really want a shallow copy sometimes, and
you have to be careful of loops ( A -> B -> A )

2) reparse (but not symlink), which you cannot shallow-copy (as far as I
understand), and loops are not possible.

So I've already seen:

* My software doesn't want to follow links, but now the new version will
force me to specifically check if its just a reparse-file and then follow.

* Whole-disk backup software don't follow symlinks because they assume
they'll get the real file later. Reparse (nonsymlink) files do not have
any other "real file" so those files are not being backed up at all right
now.

So treating as a symlink causes more trouble than the helping the one edge
case.

reparse-files-non-symlink is such a specialised case, I'd personally want a
specialised get_reparse_info kind of function, so if I really need to care,
then I can find that information.

Your thoughts?
Cheers,
Paul

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-07-27 11:30:55 UTC
Permalink
On 27 Jul 2015 at 10:55, Paul Harris wrote:

> > However, they all still look like symlinks to me. Just because the OS
> > magically replaces them with the real file on first access is
> > immaterial - the same thing could happen on Linux. If you don't treat
> > them as symlinks, there is no way of inspecting the link without
> > causing it to be auto-downloaded which could be catastrophic in some
> > use cases.
> >
> > I still vote for pseudo-symlinks to be reported by Filesystem as
> > symlinks.
> >
>
>
> I did think about that, but the design of these reparse points intends for
> these files to be treated as plain files by the client - as per MS
> documents.

This is like saying that POSIX symlinks are intended to be treated as
their target, which is the whole point of using them.

Reparse points are the *technology* by which Microsoft implemented
symlinks in NTFS. They offer a *family* of symlink implementations,
all with varying semantics. Some of that family bear strong
resemblence to the much more limited POSIX symlink, others are quite
different.

If you weren't on NTFS, the technology used to implement symlinks is
different. For example, the NT kernel provides its own non-persistent
symlink implementation totally separate from NTFS.

> Plus, I understand it as: the reparse buffer is entirely driver-specific,
> and so you can't expect boost or any user program to be able to decode what
> is inside the reparse buffer and do anything intelligent.

Microsoft have published the structure for their reparse tag formats.
Anyone can parse that structure (AFIO does).

> AND the
> resolving is done by the driver on the server side. Note that there are
> probably a dozen products out there that use these reparse buffers for
> their storage solution... its not just windows dedup.

The resolution varies actually. For example junction points are
resolved server side, symlinks are resolved client side.

> So, I don't see how the client can't do anything intelligent with symlink
> knowledge,
> AND if boost library users are forced to treat them as symlinks, then you
> now have 2 kinds of symlinks:
>
> 1) standard symlink, which you really want a shallow copy sometimes, and
> you have to be careful of loops ( A -> B -> A )
>
> 2) reparse (but not symlink), which you cannot shallow-copy (as far as I
> understand), and loops are not possible.

You can copy the standard Microsoft reparse points as those are
documented.

I see no reason why Filesystem's read_symlink(), create_symlink() and
copy_symlink() all don't work just fine if upgraded to understand
more reparse point types.

> * My software doesn't want to follow links, but now the new version will
> force me to specifically check if its just a reparse-file and then follow.

No, that depends on whatever the OS does with the symlink. Ordinarily
I would assume it dereferences the link unless you specifically ask
for it not to, same as on POSIX i.e. if you lstat() it, it returns
the stat for the symlink, if you stat() it it returns the stat for
the target.

> * Whole-disk backup software don't follow symlinks because they assume
> they'll get the real file later. Reparse (nonsymlink) files do not have
> any other "real file" so those files are not being backed up at all right
> now.
>
> So treating as a symlink causes more trouble than the helping the one edge
> case.
>
> reparse-files-non-symlink is such a specialised case, I'd personally want a
> specialised get_reparse_info kind of function, so if I really need to care,
> then I can find that information.
>
> Your thoughts?

I think Filesystem should provide what POSIX provides. Where Windows
provides close enough to POSIX behaviours we should support that too.

However pages of special Windows support isn't what Boost does
usually. We're here to abstract out the commonalities generally
speaking.

I agree Filesystem (and AFIO) should recognise deduped files as
something valid and can be worked with. Anything past that is up to
the end user.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Paul Harris
2015-07-28 01:33:25 UTC
Permalink
I think we are not on the same page. Let me try and refocus the
discussion...

With symlinks, there is more than one access point to the same file
content. (ie multiple file names to the identical content).

That makes symlinks fundamentally different to regular files. And it's why
they are treated differently. Eg don't back up content twice.

Is that statement correct?

Reparse point files (that are not junctions or symlinks) do not have an
alternate access point through the file system.

You cannot access the underlying data via another file name. Eg dedup
files.

Is that also correct?

Cheers,
Paul
On 27 Jul 2015 8:42 pm, "Niall Douglas" <***@nedprod.com> wrote:

> On 27 Jul 2015 at 10:55, Paul Harris wrote:
>
> > > However, they all still look like symlinks to me. Just because the OS
> > > magically replaces them with the real file on first access is
> > > immaterial - the same thing could happen on Linux. If you don't treat
> > > them as symlinks, there is no way of inspecting the link without
> > > causing it to be auto-downloaded which could be catastrophic in some
> > > use cases.
> > >
> > > I still vote for pseudo-symlinks to be reported by Filesystem as
> > > symlinks.
> > >
> >
> >
> > I did think about that, but the design of these reparse points intends
> for
> > these files to be treated as plain files by the client - as per MS
> > documents.
>
> This is like saying that POSIX symlinks are intended to be treated as
> their target, which is the whole point of using them.
>
> Reparse points are the *technology* by which Microsoft implemented
> symlinks in NTFS. They offer a *family* of symlink implementations,
> all with varying semantics. Some of that family bear strong
> resemblence to the much more limited POSIX symlink, others are quite
> different.
>
> If you weren't on NTFS, the technology used to implement symlinks is
> different. For example, the NT kernel provides its own non-persistent
> symlink implementation totally separate from NTFS.
>
> > Plus, I understand it as: the reparse buffer is entirely driver-specific,
> > and so you can't expect boost or any user program to be able to decode
> what
> > is inside the reparse buffer and do anything intelligent.
>
> Microsoft have published the structure for their reparse tag formats.
> Anyone can parse that structure (AFIO does).
>
> > AND the
> > resolving is done by the driver on the server side. Note that there are
> > probably a dozen products out there that use these reparse buffers for
> > their storage solution... its not just windows dedup.
>
> The resolution varies actually. For example junction points are
> resolved server side, symlinks are resolved client side.
>
> > So, I don't see how the client can't do anything intelligent with symlink
> > knowledge,
> > AND if boost library users are forced to treat them as symlinks, then you
> > now have 2 kinds of symlinks:
> >
> > 1) standard symlink, which you really want a shallow copy sometimes, and
> > you have to be careful of loops ( A -> B -> A )
> >
> > 2) reparse (but not symlink), which you cannot shallow-copy (as far as I
> > understand), and loops are not possible.
>
> You can copy the standard Microsoft reparse points as those are
> documented.
>
> I see no reason why Filesystem's read_symlink(), create_symlink() and
> copy_symlink() all don't work just fine if upgraded to understand
> more reparse point types.
>
> > * My software doesn't want to follow links, but now the new version will
> > force me to specifically check if its just a reparse-file and then
> follow.
>
> No, that depends on whatever the OS does with the symlink. Ordinarily
> I would assume it dereferences the link unless you specifically ask
> for it not to, same as on POSIX i.e. if you lstat() it, it returns
> the stat for the symlink, if you stat() it it returns the stat for
> the target.
>
> > * Whole-disk backup software don't follow symlinks because they assume
> > they'll get the real file later. Reparse (nonsymlink) files do not have
> > any other "real file" so those files are not being backed up at all right
> > now.
> >
> > So treating as a symlink causes more trouble than the helping the one
> edge
> > case.
> >
> > reparse-files-non-symlink is such a specialised case, I'd personally
> want a
> > specialised get_reparse_info kind of function, so if I really need to
> care,
> > then I can find that information.
> >
> > Your thoughts?
>
> I think Filesystem should provide what POSIX provides. Where Windows
> provides close enough to POSIX behaviours we should support that too.
>
> However pages of special Windows support isn't what Boost does
> usually. We're here to abstract out the commonalities generally
> speaking.
>
> I agree Filesystem (and AFIO) should recognise deduped files as
> something valid and can be worked with. Anything past that is up to
> the end user.
>
> Niall
>
> --
> ned Productions Limited Consulting
> http://www.nedproductions.biz/
> http://ie.linkedin.com/in/nialldouglas/
>
>
>
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-07-28 10:36:18 UTC
Permalink
On 28 Jul 2015 at 9:33, Paul Harris wrote:

> I think we are not on the same page. Let me try and refocus the
> discussion...
>
> With symlinks, there is more than one access point to the same file
> content. (ie multiple file names to the identical content).
>
> That makes symlinks fundamentally different to regular files. And it's why
> they are treated differently. Eg don't back up content twice.
>
> Is that statement correct?

No.

Symlinks are small text files consisting of the path to indirect to.
You can open them and modify them whether on POSIX or Windows.

For most OS filesystem APIs, the OS spots symlink files and magically
does the indirection for you.

> Reparse point files (that are not junctions or symlinks) do not have an
> alternate access point through the file system.
>
> You cannot access the underlying data via another file name. Eg dedup
> files.
>
> Is that also correct?

Reparse points are just like POSIX symlink files - they are small
files containing the path of where to indirect to.

They are not special in any way, except by triggering exceptional
behaviour in most OS APIs.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Andrey Semashev
2015-07-28 11:07:23 UTC
Permalink
On 28.07.2015 04:33, Paul Harris wrote:
> I think we are not on the same page. Let me try and refocus the
> discussion...
>
> With symlinks, there is more than one access point to the same file
> content. (ie multiple file names to the identical content).
>
> That makes symlinks fundamentally different to regular files. And it's why
> they are treated differently. Eg don't back up content twice.
>
> Is that statement correct?

As Niall already commented, that's not correct. What you described is
more like a hardlink [1].

You can easily spot the difference if you rename or delete the file the
link points to. The symlink will keep pointing to the old file (thus
being a dangling symlink) while the hardlink will still be pointing to
the file content.

A hardlink is actually not any more special than a regular file. Put
simply, from the filesystem perspective any file is a name pointing to
the content. When you create a new file, there's only one such name.
When you create a hardlink, you create another name pointing to the same
content and increment the reference count to the content. The two names
are equivalent, and the content exists as long as there are names
referencing it.

[1] https://en.wikipedia.org/wiki/Hard_link


_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Paul Harris
2015-07-28 12:40:53 UTC
Permalink
On 28 July 2015 at 19:07, Andrey Semashev <***@gmail.com> wrote:

> On 28.07.2015 04:33, Paul Harris wrote:
>
>> I think we are not on the same page. Let me try and refocus the
>> discussion...
>>
>> With symlinks, there is more than one access point to the same file
>> content. (ie multiple file names to the identical content).
>>
>> That makes symlinks fundamentally different to regular files. And it's why
>> they are treated differently. Eg don't back up content twice.
>>
>> Is that statement correct?
>>
>
> As Niall already commented, that's not correct. What you described is more
> like a hardlink [1].
>
> You can easily spot the difference if you rename or delete the file the
> link points to. The symlink will keep pointing to the old file (thus being
> a dangling symlink) while the hardlink will still be pointing to the file
> content.
>
> A hardlink is actually not any more special than a regular file. Put
> simply, from the filesystem perspective any file is a name pointing to the
> content. When you create a new file, there's only one such name. When you
> create a hardlink, you create another name pointing to the same content and
> increment the reference count to the content. The two names are equivalent,
> and the content exists as long as there are names referencing it.
>
> [1] https://en.wikipedia.org/wiki/Hard_link
>
>
>
I think my point is being missed... I am not debating symlinks or
hardlinks...

I am _happy_ with the way hardlinks and symlinks are treated, in both posix
and windows.

I am _happy_ with the way reparse-based-symlinks and junctions are treated
in windows.

I am _disagree_ with the way dedup'd files are currently treated as a
special file (as if they were a device or a character file or a fifo or a
socket). device/socket/fifos all need to be read in a special way, but
dedup'd files should be read as if they were a plain file.

I _disagree_ that a dedup file should be treated as if they are a symlink.
This is because a dedup file does not point to another file (or inode) on
the file system, which is a characteristic of a symlink or a hardlink. It
is basically just a compressed file. We don't treat NTFS-compressed files
differently from regular files, why are we treating dedup'd files
differently?


Dedup files and symlink files on windows both (unfortunately) use the same
mechanism - reparse points. But we should only treat symlink and junction
reparse point files as symlinks. Anything else should be treated as a
regular file. That is how I am reading the MS docs, and that is how I am
experiencing working with the filesystems.


Simple example is when building a backup program for files
in a _single directory_.

Lets say you want to store every file's content once.
When you find a directory, ignore it.
When you find an "other" file, ignore it (how can you backup a device /
character file / etc?)
When you find a symlink, you want to store just the link.
When you find a regular file, you want to store the contents.
When you find a reparse-point-symlink, you want to store just the link
(like a posix symlink).
When you find a dedup'd file, you want to store the contents (like a posix
regular file).

for (directory_iterator ...)
{
if (is_symlink(fn)) backup_link(fn);
if (is_regular_file(fn)) backup_contents(fn);
if (is_directory(fn)) ignore(fn);
if (is_other(fn)) ignore(fn);
}

Currently, this pseudo code would fail to backup any automatic dedup'd
files (which are basically any file older than 3 days on some of my sites).
It fails because a dedup'd file is currently an "other".

If you treat a dedup'd file as a symlink, only the "link" will be backed up.
This link points to a magical place that is impossible to read other than
simply reading "fn".

So how does this simple program backup the dedup'd file contents?

cheers,
Paul

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Beman Dawes
2015-07-28 13:59:30 UTC
Permalink
I am watching this thread closely. But I'm traveling until next week so
won't comment on technical issues until then.

--Beman

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Paul Harris
2015-08-14 06:01:25 UTC
Permalink
Hi Beman,

Have you had time to consider the way forward for boost::filesystem?

cheers,
Paul


On 28 July 2015 at 21:59, Beman Dawes <***@acm.org> wrote:

> I am watching this thread closely. But I'm traveling until next week so
> won't comment on technical issues until then.
>
> --Beman
>
> _______________________________________________
> Unsubscribe & other changes:
> http://lists.boost.org/mailman/listinfo.cgi/boost
>

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-07-29 02:06:03 UTC
Permalink
On 28 Jul 2015 at 20:40, Paul Harris wrote:

> I am _disagree_ with the way dedup'd files are currently treated as a
> special file (as if they were a device or a character file or a fifo or a
> socket). device/socket/fifos all need to be read in a special way, but
> dedup'd files should be read as if they were a plain file.
>
> I _disagree_ that a dedup file should be treated as if they are a symlink.
> This is because a dedup file does not point to another file (or inode) on
> the file system, which is a characteristic of a symlink or a hardlink. It
> is basically just a compressed file. We don't treat NTFS-compressed files
> differently from regular files, why are we treating dedup'd files
> differently?

NTFS compressed files act exactly like normal files. Reparse point
files do not and require significant additional processing to figure
out what kind they are. That's the difference.

From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
metadata about a file entry, it can zero cost learn if an entry is a
reparse point by examining FileAttributes for the
FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
reparse point file it is without opening the file and asking.

Windows' CreateFile() API is astonishingly slow. To require calling
that, then an additional NtQueryDirectoryFile() to fetch the
FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
is the fastest way I know of to fetch the reparse point tag code -
would impose an enormous performance penalty for all file entries
marked with FILE_ATTRIBUTE_REPARSE_POINT.

I appreciate you're saying the cost is worth it, but we're thinking
all Boost users here, not just the small minority on Windows Server
2012 with dedup turned on.

> for (directory_iterator ...)
> {
> if (is_symlink(fn)) backup_link(fn);
> if (is_regular_file(fn)) backup_contents(fn);
> if (is_directory(fn)) ignore(fn);
> if (is_other(fn)) ignore(fn);
> }
>
> Currently, this pseudo code would fail to backup any automatic dedup'd
> files (which are basically any file older than 3 days on some of my sites).
> It fails because a dedup'd file is currently an "other".
>
> If you treat a dedup'd file as a symlink, only the "link" will be backed up.
> This link points to a magical place that is impossible to read other than
> simply reading "fn".
>
> So how does this simple program backup the dedup'd file contents?

I appreciate the problem with saying something is a symlink, but
trying to retrieve the target of that symlink has to error out
because it's meaningless in the case of a dedup symlink.

What seems to me the best route forward is you do something like
this:

if (is_symlink(fn))
{
error_code ec;
auto target=read_symlink(fn, ec);
if(!ec)
backup_link(fn);
}

Because is_regular_file() and is_directory() use status(), they
follow any symlink so you can safely fall through to those.

Is this acceptable to you? If so, I'll update AFIO accordingly to
match these new semantics and add a note to the docs. I'm sure Beman
will consider something similar when he gets to be less busy.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Paul Harris
2015-07-29 04:27:45 UTC
Permalink
On 29 July 2015 at 10:06, Niall Douglas <***@nedprod.com> wrote:

> On 28 Jul 2015 at 20:40, Paul Harris wrote:
>
> > I am _disagree_ with the way dedup'd files are currently treated as a
> > special file (as if they were a device or a character file or a fifo or a
> > socket). device/socket/fifos all need to be read in a special way, but
> > dedup'd files should be read as if they were a plain file.
> >
> > I _disagree_ that a dedup file should be treated as if they are a
> symlink.
> > This is because a dedup file does not point to another file (or inode) on
> > the file system, which is a characteristic of a symlink or a hardlink.
> It
> > is basically just a compressed file. We don't treat NTFS-compressed
> files
> > differently from regular files, why are we treating dedup'd files
> > differently?
>
> NTFS compressed files act exactly like normal files. Reparse point
> files do not and require significant additional processing to figure
> out what kind they are. That's the difference.
>

You only need to process symlink-reparse-point-files.
Dedup reparse point files can be treated the same as a normal file.


>
> From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
> metadata about a file entry, it can zero cost learn if an entry is a
> reparse point by examining FileAttributes for the
> FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
> reparse point file it is without opening the file and asking.
>
> Windows' CreateFile() API is astonishingly slow. To require calling
> that, then an additional NtQueryDirectoryFile() to fetch the
> FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
> is the fastest way I know of to fetch the reparse point tag code -
> would impose an enormous performance penalty for all file entries
> marked with FILE_ATTRIBUTE_REPARSE_POINT.
>
>
I have no comment on performance. I want things to work.


> I appreciate you're saying the cost is worth it, but we're thinking
> all Boost users here, not just the small minority on Windows Server
> 2012 with dedup turned on.
>

You don't seem to understand that this affects ANY Windows client that talks
to a Windows 2012 dedup-enabled server.

Which, as of last month, has gone from zero to 5 different companies in
my world. Seems that all the IT departments are upgrading after the end-of-
financial-year.

So, a Windows 7 user will be accessing dedup files.


> > for (directory_iterator ...)
> > {
> > if (is_symlink(fn)) backup_link(fn);
> > if (is_regular_file(fn)) backup_contents(fn);
> > if (is_directory(fn)) ignore(fn);
> > if (is_other(fn)) ignore(fn);
> > }
> >
> > Currently, this pseudo code would fail to backup any automatic dedup'd
> > files (which are basically any file older than 3 days on some of my
> sites).
> > It fails because a dedup'd file is currently an "other".
> >
> > If you treat a dedup'd file as a symlink, only the "link" will be backed
> up.
> > This link points to a magical place that is impossible to read other than
> > simply reading "fn".
> >
> > So how does this simple program backup the dedup'd file contents?
>
> I appreciate the problem with saying something is a symlink, but
> trying to retrieve the target of that symlink has to error out
> because it's meaningless in the case of a dedup symlink.
>

Please stop calling it "dedup symlink". It is _not_ any kind of symlink.
That is the point of misunderstanding, we are not on the same page.


>
> What seems to me the best route forward is you do something like
> this:
>
> if (is_symlink(fn))
> {
> error_code ec;
> auto target=read_symlink(fn, ec);
> if(!ec)
> backup_link(fn);
> }
>
> Because is_regular_file() and is_directory() use status(), they
> follow any symlink so you can safely fall through to those.
>
>
This is unacceptable, because I do not want to follow symlinks.
That was specified in the example.

Lets be more specific about the example directory to backup.

On Monday, it contains:
FILE_A (a plain file)
FILE_B (a symlink to FILE_A)
FILE_C (a plain copy of FILE_A)

Backup should store this:
FILE_A contents. FILE_B link. FILE_C contents.

On Tuesday, dedup/archival has run on the server. Directory now contains:
FILE_A (a dedup file)
FILE_B (a symlink to FILE_A)
FILE_C (a dedup file)

Backup SHOULD store this:
FILE_A contents. FILE_B link. FILE_C contents.


IF you treat dedup=symlink, then the example will instead store:
FILE_A link. FILE_B link. FILE_C link.
(although I have no idea what "FILE_A link" will actually read)

If you follow symlinks, then backup stores the wrong thing:
FILE_A contents. FILE_B contents (WRONG). FILE_C contents.

If you treat dedup files as regular files, then backup stores correctly:
FILE_A contents. FILE_B link. FILE_C contents.


cheers,
Paul

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-07-29 11:12:07 UTC
Permalink
On 29 Jul 2015 at 12:27, Paul Harris wrote:

> > I appreciate the problem with saying something is a symlink, but
> > trying to retrieve the target of that symlink has to error out
> > because it's meaningless in the case of a dedup symlink.
>
> Please stop calling it "dedup symlink". It is _not_ any kind of symlink.
> That is the point of misunderstanding, we are not on the same page.

It *is* a kind of symlink.

Deduped files on NTFS are kept as a chain of compressed fragments.
When you open the file handle, all that has to be decompressed and
rechained back together into a temporary inode.

This is why deduped files are so publicly marked because they are
much more expensive to open than regular files. I suspect that's why
CIFS exports the flag instead of actually treating the file as a
proper regular file because you want client programs to know this
isn't a regular file.

Anyway, thanks to Gavin I have a solution for AFIO which is optimal,
so I'll commit that shortly - these deduped files are going to get a
special flag, not least because handle::path() is going to return
something weird for the open file handle.

Beman has a trickier problem on his hands - he can either add a
special type of flag for these files and then the OP's code falls
through to is_regular_file and he's happy. Or he can filter out the
symlink flag when the reparse tag is a dedup, and always return a
regular file instead. I don't know which is better for Filesystem.

Thanks to Gavin for spotting that the reserved field in
WIN32_FIND_DATA is officially the reparse tag type!

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Gavin Lambert
2015-07-29 06:09:49 UTC
Permalink
On 29/07/2015 14:06, Niall Douglas wrote:
> NTFS compressed files act exactly like normal files. Reparse point
> files do not and require significant additional processing to figure
> out what kind they are. That's the difference.
>
> From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
> metadata about a file entry, it can zero cost learn if an entry is a
> reparse point by examining FileAttributes for the
> FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
> reparse point file it is without opening the file and asking.
>
> Windows' CreateFile() API is astonishingly slow. To require calling
> that, then an additional NtQueryDirectoryFile() to fetch the
> FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
> is the fastest way I know of to fetch the reparse point tag code -
> would impose an enormous performance penalty for all file entries
> marked with FILE_ATTRIBUTE_REPARSE_POINT.

If it helps,
https://msdn.microsoft.com/en-us/library/windows/desktop/aa365511.aspx
seems to specify that reparse points provide their tag id in the
dwReserved0 field of the WIN32_FIND_DATA structure (I'm not sure how
that maps to the native API, but I assume it's somewhere). That should
be sufficient to identify the reparse point type.

(https://msdn.microsoft.com/en-us/library/windows/desktop/aa365740.aspx
backs this up, incidentally.)

Granted, a single NtQueryDirectoryFile on the whole directory is not
enough to get both sets of data, but you should still be able to do it
in just two calls per directory (times however many calls are required
to fully enumerate the directory, of course).

Presumably you're currently using one of FileBothDirectoryInformation or
FileFullDirectoryInformation. You should be able to switch to the "Id"
variants (FileIdBothDirectoryInformation or
FileIdFullDirectoryInformation) instead (if you're not already using
them). This gives you a FileId for each file, along with the other
information.

After you've enumerated the entire directory, you can go back and get
FileReparsePointInformation for the whole directory, and then match up
the FileId against the FileReference to merge the data and get the
reparse tag for each file.

(I haven't tested this, so I'm not sure if it gives you an empty tag for
files that aren't reparse points, or only lists reparse points. The
latter would be nice, as it would be close to zero overhead for
directories that do not contain reparse points.)

Presumably Win32 FindFirstFile is doing something like this internally,
since it does provide the reparse tag.


I'm not sure if it's current, but
http://blogs.technet.com/b/filecab/archive/2013/02/14/dfsr-reparse-point-support-or-avoiding-schr-246-dinger-s-file.aspx
seems to suggest the following behaviour as reasonable:

- treating IO_REPARSE_TAG_MOUNT_POINT as directory symlinks
- treating IO_REPARSE_TAG_SYMLINK as symlinks
- treating IO_REPARSE_TAG_DEDUP, IO_REPARSE_TAG_SIS, and
IO_REPARSE_TAG_HSM as regular files
- treating any other tag as something to be ignored (in most cases)


There was also a note that you can use IsReparseTagNameSurrogate to
determine if a given reparse point tag is a surrogate (some kind of
link) or not (treat like regular file). That might be the best option,
if it's consistent -- and at least for the official MS tags it seems to
be; MOUNT_POINT and SYMLINK are surrogates and the other types are not.

> I appreciate you're saying the cost is worth it, but we're thinking
> all Boost users here, not just the small minority on Windows Server
> 2012 with dedup turned on.

I'm not on Server 2012, but this thread caught my attention because I
remember encountering a bug that prevented all WinXP clients from
accessing deduped files on CIFS shares provided by Server 2012. I think
in the end this was a server-side bug related to McAfee and the
different protocols used by WinXP vs. Win7, and so clients shouldn't
normally be able to see whether files are deduped or not remotely, but I
haven't explicitly verified that. If CIFS shares do expose files as
dedup reparse points instead of concealing that then it might affect
quite a lot of users.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-07-29 10:59:47 UTC
Permalink
On 29 Jul 2015 at 18:09, Gavin Lambert wrote:

> On 29/07/2015 14:06, Niall Douglas wrote:
> > NTFS compressed files act exactly like normal files. Reparse point
> > files do not and require significant additional processing to figure
> > out what kind they are. That's the difference.
> >
> > From AFIO's perspective, when it does NtQueryDirectoryFile() to fetch
> > metadata about a file entry, it can zero cost learn if an entry is a
> > reparse point by examining FileAttributes for the
> > FILE_ATTRIBUTE_REPARSE_POINT flag. It cannot tell what kind of
> > reparse point file it is without opening the file and asking.
> >
> > Windows' CreateFile() API is astonishingly slow. To require calling
> > that, then an additional NtQueryDirectoryFile() to fetch the
> > FILE_REPARSE_POINT_INFORMATION metadata and close the handle - which
> > is the fastest way I know of to fetch the reparse point tag code -
> > would impose an enormous performance penalty for all file entries
> > marked with FILE_ATTRIBUTE_REPARSE_POINT.
>
> If it helps,
> https://msdn.microsoft.com/en-us/library/windows/desktop/aa365511.aspx
> seems to specify that reparse points provide their tag id in the
> dwReserved0 field of the WIN32_FIND_DATA structure (I'm not sure how
> that maps to the native API, but I assume it's somewhere). That should
> be sufficient to identify the reparse point type.

That does help greatly in fact. I know FindXXXFile doesn't open each
file, so somehow or other the Win32 layer is able to fetch the
reparse tag type for directory entries purely from the directory
handle.

> Granted, a single NtQueryDirectoryFile on the whole directory is not
> enough to get both sets of data, but you should still be able to do it
> in just two calls per directory (times however many calls are required
> to fully enumerate the directory, of course).
>
> Presumably you're currently using one of FileBothDirectoryInformation or
> FileFullDirectoryInformation. You should be able to switch to the "Id"
> variants (FileIdBothDirectoryInformation or
> FileIdFullDirectoryInformation) instead (if you're not already using
> them). This gives you a FileId for each file, along with the other
> information.
>
> After you've enumerated the entire directory, you can go back and get
> FileReparsePointInformation for the whole directory, and then match up
> the FileId against the FileReference to merge the data and get the
> reparse tag for each file.
>
> (I haven't tested this, so I'm not sure if it gives you an empty tag for
> files that aren't reparse points, or only lists reparse points. The
> latter would be nice, as it would be close to zero overhead for
> directories that do not contain reparse points.)

Unfortunately getting FileReparsePointInformation returns just a
single record which is the reparse point for the directory handle
being enumerated. It doesn't return reparse tags for directory
contents.

There is an index of all reparse points on a NTFS volume in a magic
NTFS file stream, but that's NTFS specific code, and it requires a
file handle to be opened.

I'm thinking that as reparse points are really just an overload on
EA, maybe the returned EaSize field is magically set to the reparse
tag when attributes specify it's a reparse point file? I'd have to
experiment to find out. I can't see any other obvious field which
would return the reparse tag.

EDIT: What a guess I just made!:
https://www.osronline.com/showthread.cfm?link=171655. Thanks Gavin,
you just solved the problem for AFIO at least.

> > I appreciate you're saying the cost is worth it, but we're thinking
> > all Boost users here, not just the small minority on Windows Server
> > 2012 with dedup turned on.
>
> I'm not on Server 2012, but this thread caught my attention because I
> remember encountering a bug that prevented all WinXP clients from
> accessing deduped files on CIFS shares provided by Server 2012. I think
> in the end this was a server-side bug related to McAfee and the
> different protocols used by WinXP vs. Win7, and so clients shouldn't
> normally be able to see whether files are deduped or not remotely, but I
> haven't explicitly verified that. If CIFS shares do expose files as
> dedup reparse points instead of concealing that then it might affect
> quite a lot of users.

I had understood from the OP that CIFS is exporting the reparse point
tag to clients, hence the breakage.

The reason, I suspect, that CIFS is being so braindead here is that
opening a deduped file is more expensive than usual and clients ought
to know. Which is exactly why I am opposed to treating these things
as a regular file.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Gavin Lambert
2015-07-29 23:17:26 UTC
Permalink
On 29/07/2015 22:59, Niall Douglas quoth:
> Unfortunately getting FileReparsePointInformation returns just a
> single record which is the reparse point for the directory handle
> being enumerated. It doesn't return reparse tags for directory
> contents.

Ah, true. I missed that part. Seems kinda annoying they made that
different from all the other information classes.

> I'm thinking that as reparse points are really just an overload on
> EA, maybe the returned EaSize field is magically set to the reparse
> tag when attributes specify it's a reparse point file? I'd have to
> experiment to find out. I can't see any other obvious field which
> would return the reparse tag.
>
> EDIT: What a guess I just made!:
> https://www.osronline.com/showthread.cfm?link=171655. Thanks Gavin,
> you just solved the problem for AFIO at least.

Yep, it appears so. That makes life easier.

You should probably make it generic via IsReparseTagNameSurrogate as I
mentioned earlier rather than checking for the symlink/dedup tags
specifically. So:

1. entries with FILE_ATTRIBUTE_DIRECTORY are directories.
2. entries with FILE_ATTRIBUTE_REPARSE_POINT *and* a tag with
IsReparseTagNameSurrogate == true are symlinks. (And possibly also
directories, via #1.)
3. entries with FILE_ATTRIBUTE_REPARSE_POINT *and* a tag with
IsReparseTagNameSurrogate == false are regular files that are possibly
slow to open.
4. entries with FILE_ATTRIBUTE_COMPRESSED are regular files that are
possibly slow to open.
5. entries with FILE_ATTRIBUTE_OFFLINE are regular files that are
probably not openable (or *very* slow to open).
6. entries lacking those attributes are regular files.

Sound about right?

> The reason, I suspect, that CIFS is being so braindead here is that
> opening a deduped file is more expensive than usual and clients ought
> to know. Which is exactly why I am opposed to treating these things
> as a regular file.

I think they probably should be treated the same as files with
FILE_ATTRIBUTE_COMPRESSED, since essentially it's just a different
compression scheme. I don't know whether you currently distinguish
these from regular files or not.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-07-31 22:39:39 UTC
Permalink
On 30 Jul 2015 at 11:17, Gavin Lambert wrote:

> You should probably make it generic via IsReparseTagNameSurrogate as I
> mentioned earlier rather than checking for the symlink/dedup tags
> specifically.

Actually no - AFIO can only read the target for reparse points it
knows about.

I committed a fix for this earlier today. It reports reparse points
with tag IO_REPARSE_TAG_MOUNT_POINT or IO_REPARSE_TAG_SYMLINK as
symlinks. Everything else is reported normally.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Gavin Lambert
2015-08-02 23:28:03 UTC
Permalink
On 1/08/2015 10:39, Niall Douglas wrote:
> On 30 Jul 2015 at 11:17, Gavin Lambert wrote:
>
>> You should probably make it generic via IsReparseTagNameSurrogate as I
>> mentioned earlier rather than checking for the symlink/dedup tags
>> specifically.
>
> Actually no - AFIO can only read the target for reparse points it
> knows about.
>
> I committed a fix for this earlier today. It reports reparse points
> with tag IO_REPARSE_TAG_MOUNT_POINT or IO_REPARSE_TAG_SYMLINK as
> symlinks. Everything else is reported normally.

Granted I'm not familiar with AFIO's APIs, but wouldn't it make the most
sense to report other name surrogates as symlinks as well (via an "is
this a symlink" or "get file type" method), but then if queried for the
target of an unknown symlink type it will return/throw a "not supported"
error?



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-08-03 01:43:41 UTC
Permalink
On 3 Aug 2015 at 11:28, Gavin Lambert wrote:

> > Actually no - AFIO can only read the target for reparse points it
> > knows about.
> >
> > I committed a fix for this earlier today. It reports reparse points
> > with tag IO_REPARSE_TAG_MOUNT_POINT or IO_REPARSE_TAG_SYMLINK as
> > symlinks. Everything else is reported normally.
>
> Granted I'm not familiar with AFIO's APIs,

There is a single page "cheat sheet" at
https://boostgsoc13.github.io/boost.afio/doc/html/afio/overview.html.

> but wouldn't it make the most
> sense to report other name surrogates as symlinks as well (via an "is
> this a symlink" or "get file type" method), but then if queried for the
> target of an unknown symlink type it will return/throw a "not supported"
> error?

I am not adverse to adding a "st_reparse_point" field to stat_t. This
would let client code do its own detection on Windows. Does this work
for you?

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Gavin Lambert
2015-08-03 04:38:10 UTC
Permalink
On 3/08/2015 13:43, Niall Douglas wrote:
> On 3 Aug 2015 at 11:28, Gavin Lambert wrote:
>
>>> Actually no - AFIO can only read the target for reparse points it
>>> knows about.
>>>
>>> I committed a fix for this earlier today. It reports reparse points
>>> with tag IO_REPARSE_TAG_MOUNT_POINT or IO_REPARSE_TAG_SYMLINK as
>>> symlinks. Everything else is reported normally.
>>
>> Granted I'm not familiar with AFIO's APIs,
>
> There is a single page "cheat sheet" at
> https://boostgsoc13.github.io/boost.afio/doc/html/afio/overview.html.

It would be nice if this included hyperlinks for the local types. I
have no idea what a directory_entry looks like.

(And even after manually navigating around the docs until I found
https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/classes/directory_entry.html,
I still have no idea what those fields actually *mean*. Only because
you mentioned it below did I also find
https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/structs/stat_t.html,
which is more descriptive. Although I later went back and noticed I
overlooked fetch_lstat on directory_entry. Another case where
hyperlinks would have been nice.)

>> but wouldn't it make the most
>> sense to report other name surrogates as symlinks as well (via an "is
>> this a symlink" or "get file type" method), but then if queried for the
>> target of an unknown symlink type it will return/throw a "not supported"
>> error?

Using the above vocabulary, it seems to me that:

- enumerate() / lstat() should be able to report all name surrogates
as symlinks, however that is currently done (presumably via st_type ==
symlink_file). Other reparse types should be reported as regular
files/directories.

- symlink() should be able to open unknown symlinks (since that's
just a flag to CreateFile).

- rmsymlink() should be able to delete unknown symlinks.

- target() should work for the known symlink types and fail "not
supported" (or similar) for the other name surrogate types, and fail
"invalid operation" (or similar) for any non-reparse file or
non-name-surrogate type.

Does that sound reasonable?

I suppose another variant on this would be to report known-type symlinks
as st_type == symlink_file, unknown-type name surrogates as st_type ==
type_unknown, and any other reparse point as st_type ==
regular_file/directory_file. This would have the advantage of hinting
whether target() is likely to work, but the disadvantage of being a bit
misleading.

(On a peripherally related note, it seems odd that Boost.Filesystem's
file_type appears to lack a way to express "a symlink to a directory",
which should be opened as a directory instead of as a file. Is this a
POSIX limitation, that you're required to inspect the target to
determine whether it's a file or directory? I know that Windows
provides this up-front, both for junctions and for actual symlinks,
which in turn means that if you do want to follow directory symlinks
then you can just open them as regular directories without fanfare. Of
course, that's also partly why symlinks are discouraged on Windows,
because naive enumeration code will follow them by default and hilarity
can ensue.)

> I am not adverse to adding a "st_reparse_point" field to stat_t. This
> would let client code do its own detection on Windows. Does this work
> for you?

I don't personally have a use case, so I can't really answer the last
question. As I said I'm coming at this thread from a design standpoint
rather than a practical one. (And the original focus of the thread was
on Filesystem rather than AFIO.)

Having said that, more information never hurts; but I think this should
be in addition to the behaviour described above, not instead of it.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-08-03 16:31:27 UTC
Permalink
On 3 Aug 2015 at 16:38, Gavin Lambert wrote:

> > There is a single page "cheat sheet" at
> > https://boostgsoc13.github.io/boost.afio/doc/html/afio/overview.html.
>
> It would be nice if this included hyperlinks for the local types. I
> have no idea what a directory_entry looks like.

Fixed. Each operation on the cheat sheet now lists what types are
related to it.

> (And even after manually navigating around the docs until I found
> https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/classes/directory_entry.html,
> I still have no idea what those fields actually *mean*. Only because
> you mentioned it below did I also find
> https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/structs/stat_t.html,
> which is more descriptive. Although I later went back and noticed I
> overlooked fetch_lstat on directory_entry. Another case where
> hyperlinks would have been nice.)

Fixed. Each reference page now links to related types too.

> >> but wouldn't it make the most
> >> sense to report other name surrogates as symlinks as well (via an "is
> >> this a symlink" or "get file type" method), but then if queried for the
> >> target of an unknown symlink type it will return/throw a "not supported"
> >> error?
>
> Using the above vocabulary, it seems to me that:
>
> - enumerate() / lstat() should be able to report all name surrogates
> as symlinks, however that is currently done (presumably via st_type ==
> symlink_file). Other reparse types should be reported as regular
> files/directories.

I would prefer to not report something as a symlink when target()
won't work with it. So you now have an additional stat_t flag called
st_reparse_point which is always the FILE_ATTRIBUTE_REPARSE_POINT
flag.

> - symlink() should be able to open unknown symlinks (since that's
> just a flag to CreateFile).

This works.

> - rmsymlink() should be able to delete unknown symlinks.

This works.

> - target() should work for the known symlink types and fail "not
> supported" (or similar) for the other name surrogate types, and fail
> "invalid operation" (or similar) for any non-reparse file or
> non-name-surrogate type.

This works. Unknown symlink types return an EINVAL error as per
POSIX.

> Does that sound reasonable?

Yes :)

> I suppose another variant on this would be to report known-type symlinks
> as st_type == symlink_file, unknown-type name surrogates as st_type ==
> type_unknown, and any other reparse point as st_type ==
> regular_file/directory_file. This would have the advantage of hinting
> whether target() is likely to work, but the disadvantage of being a bit
> misleading.
>
> (On a peripherally related note, it seems odd that Boost.Filesystem's
> file_type appears to lack a way to express "a symlink to a directory",
> which should be opened as a directory instead of as a file. Is this a
> POSIX limitation, that you're required to inspect the target to
> determine whether it's a file or directory? I know that Windows
> provides this up-front, both for junctions and for actual symlinks,
> which in turn means that if you do want to follow directory symlinks
> then you can just open them as regular directories without fanfare. Of
> course, that's also partly why symlinks are discouraged on Windows,
> because naive enumeration code will follow them by default and hilarity
> can ensue.)

Filesystem is trapped by POSIX however, and POSIX treats symlinks as
a special thing onto themselves.

AFIO is a bit caught here too actually. If you're enumerating a
directory you have no easy way of disambiguating between a symlink to
a directory and a symlink to a file. You basically have to try
opening it as a directory, and if it errors out you then open it as a
file.

Windows does supply what kind of symlink it is without additional
syscalls, but POSIX does not. You'd have to do two syscalls per entry
to disambiguate which is very costly for something so niche.

> > I am not adverse to adding a "st_reparse_point" field to stat_t. This
> > would let client code do its own detection on Windows. Does this work
> > for you?
>
> I don't personally have a use case, so I can't really answer the last
> question. As I said I'm coming at this thread from a design standpoint
> rather than a practical one. (And the original focus of the thread was
> on Filesystem rather than AFIO.)
>
> Having said that, more information never hurts; but I think this should
> be in addition to the behaviour described above, not instead of it.

Well you've got a st_reparse_point field now, so you can detect
reparse points which aren't those understood by AFIO and special case
them if you so desire.

The key aim for AFIO is as consistent a POSIX filesystem semantics as
is possible portably. As mentioned in earlier threads, any real world
use of async file i/o is going to need #ifdef for platforms anyway as
filing systems are so different, but where I can eliminate that I
will.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Gavin Lambert
2015-08-04 00:40:48 UTC
Permalink
On 4/08/2015 04:31, Niall Douglas wrote:
> On 3 Aug 2015 at 16:38, Gavin Lambert wrote:
>
>>> There is a single page "cheat sheet" at
>>> https://boostgsoc13.github.io/boost.afio/doc/html/afio/overview.html.
>>
>> It would be nice if this included hyperlinks for the local types. I
>> have no idea what a directory_entry looks like.
>
> Fixed. Each operation on the cheat sheet now lists what types are
> related to it.

Nice, thanks.

>> (And even after manually navigating around the docs until I found
>> https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/classes/directory_entry.html,
>> I still have no idea what those fields actually *mean*. Only because
>> you mentioned it below did I also find
>> https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/structs/stat_t.html,
>> which is more descriptive. Although I later went back and noticed I
>> overlooked fetch_lstat on directory_entry. Another case where
>> hyperlinks would have been nice.)
>
> Fixed. Each reference page now links to related types too.

They still seem to be missing on the function reference pages (I checked
https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/functions/enumerate/enumerate_6_max_items_first_throwing.html),
which is probably where they'd be the most useful.

I'm used to the style of the Boost.Asio docs (eg.
http://www.boost.org/doc/libs/1_58_0/doc/html/boost_asio/reference/async_read/overload1.html),
where all the types are linked directly in the method description. (Of
course, mostly they're templates, but still...)

> I would prefer to not report something as a symlink when target()
> won't work with it. So you now have an additional stat_t flag called
> st_reparse_point which is always the FILE_ATTRIBUTE_REPARSE_POINT
> flag.

I guess that depends on usage cases -- if it's most common to write code
like if (type() == symlink_file) { do something with target(); } then
you have a point. Although code that has sufficient error checking
should be able to cope with the idea of a symlink that has an unreadable
target.

But it seems odd to me to claim that a file is *not* a symlink just
because you're told that it's a type of symlink that you don't know how
to read.

Having said that, I don't know how common custom symlinks are in the
wild, or if they even exist at all.

> AFIO is a bit caught here too actually. If you're enumerating a
> directory you have no easy way of disambiguating between a symlink to
> a directory and a symlink to a file. You basically have to try
> opening it as a directory, and if it errors out you then open it as a
> file.
>
> Windows does supply what kind of symlink it is without additional
> syscalls, but POSIX does not. You'd have to do two syscalls per entry
> to disambiguate which is very costly for something so niche.

Perhaps rather than just having symlink_file, Filesystem should have
symlink_file, symlink_directory, and symlink_entry? POSIX would return
the latter (indicating that it's unknown whether it's a file or
directory) while Windows would return one of the first two. This would
still allow code to be written in a reasonably platform-independent manner.

Another option might be for stat_t to have a field that contains the
OS-native flags, so that on Windows the DIRECTORY flag could be examined
directly. This might also allow for other esoteric attributes
(COMPRESSED, ENCRYPTED, NOT_CONTENT_INDEXED, etc) to be inspected/set as
desired, although that's probably more useful in Filesystem rather than
AFIO. Although I think this is uglier than the above for the
enumeration case.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-08-04 22:14:48 UTC
Permalink
On 4 Aug 2015 at 12:40, Gavin Lambert wrote:

> > Fixed. Each reference page now links to related types too.
>
> They still seem to be missing on the function reference pages (I checked
> https://boostgsoc13.github.io/boost.afio/doc/html/afio/reference/functions/enumerate/enumerate_6_max_items_first_throwing.html),
> which is probably where they'd be the most useful.

It's actually there, it's just the docs generation tooling has
collapsed the paragraphs into a single large paragraph. I'll look
into a workaround after I've ported AFIO onto the new APIBind based
multi-abi implementation of Boost.Monad.

> I'm used to the style of the Boost.Asio docs (eg.
> http://www.boost.org/doc/libs/1_58_0/doc/html/boost_asio/reference/async_read/overload1.html),
> where all the types are linked directly in the method description. (Of
> course, mostly they're templates, but still...)

I don't think the Boost.Geometry doxygen to qbk tool AFIO uses can do
this.

For Boost.Monad I'm sticking to a pure doxygen solution. I've wasted
a lot of blood and sweat for little gain on AFIO's BoostBook
documentation, and doxygen I think is a much more complete
documentation tool than it once used to be.

It would be really great if someone could skin doxygen to output
something very close to BoostBook's output as I find doxygen's
default HTML output pretty awful, but I suspect many of us will need
to adopt doxygen first to generate the pressue for someone to do the
skinning work.

> > I would prefer to not report something as a symlink when target()
> > won't work with it. So you now have an additional stat_t flag called
> > st_reparse_point which is always the FILE_ATTRIBUTE_REPARSE_POINT
> > flag.
>
> I guess that depends on usage cases -- if it's most common to write code
> like if (type() == symlink_file) { do something with target(); } then
> you have a point. Although code that has sufficient error checking
> should be able to cope with the idea of a symlink that has an unreadable
> target.
>
> But it seems odd to me to claim that a file is *not* a symlink just
> because you're told that it's a type of symlink that you don't know how
> to read.

I'd like to think AFIO's symlinks are "POSIX(-y) symlinks".

> Having said that, I don't know how common custom symlinks are in the
> wild, or if they even exist at all.
>
> > AFIO is a bit caught here too actually. If you're enumerating a
> > directory you have no easy way of disambiguating between a symlink to
> > a directory and a symlink to a file. You basically have to try
> > opening it as a directory, and if it errors out you then open it as a
> > file.
> >
> > Windows does supply what kind of symlink it is without additional
> > syscalls, but POSIX does not. You'd have to do two syscalls per entry
> > to disambiguate which is very costly for something so niche.
>
> Perhaps rather than just having symlink_file, Filesystem should have
> symlink_file, symlink_directory, and symlink_entry? POSIX would return
> the latter (indicating that it's unknown whether it's a file or
> directory) while Windows would return one of the first two. This would
> still allow code to be written in a reasonably platform-independent manner.
>
> Another option might be for stat_t to have a field that contains the
> OS-native flags, so that on Windows the DIRECTORY flag could be examined
> directly. This might also allow for other esoteric attributes
> (COMPRESSED, ENCRYPTED, NOT_CONTENT_INDEXED, etc) to be inspected/set as
> desired, although that's probably more useful in Filesystem rather than
> AFIO. Although I think this is uglier than the above for the
> enumeration case.

Given the Filesystem TS has shipped, I'd say that moment has passed.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Gavin Lambert
2015-08-04 23:27:49 UTC
Permalink
On 5/08/2015 10:14, Niall Douglas wrote:
>> I guess that depends on usage cases -- if it's most common to write code
>> like if (type() == symlink_file) { do something with target(); } then
>> you have a point. Although code that has sufficient error checking
>> should be able to cope with the idea of a symlink that has an unreadable
>> target.
>>
>> But it seems odd to me to claim that a file is *not* a symlink just
>> because you're told that it's a type of symlink that you don't know how
>> to read.
>
> I'd like to think AFIO's symlinks are "POSIX(-y) symlinks".

That's least-common-denominator thinking. Which is hard to get away
from when building a cross-platform abstraction layer, I know, but
"because it's POSIX" isn't really a good justification either. There
are some things that POSIX is very bad at (mostly for historic reasons).

If you have a function that operates on symlinks, then it should operate
on *all* symlinks, not merely a subset of them.

Like I said though, it's possible the distinction is academic and not
practical; I don't know if there are any other kinds of surrogates in
the wild. So I can understand the reluctance. :)

> Given the Filesystem TS has shipped, I'd say that moment has passed.

Too late to be in the standard (yet), maybe. But one of the roles of
Boost is to be better than the standard, so it can be the *next*
standard. :)



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Niall Douglas
2015-08-05 17:03:57 UTC
Permalink
On 5 Aug 2015 at 11:27, Gavin Lambert wrote:

> >> But it seems odd to me to claim that a file is *not* a symlink just
> >> because you're told that it's a type of symlink that you don't know how
> >> to read.
> >
> > I'd like to think AFIO's symlinks are "POSIX(-y) symlinks".
>
> That's least-common-denominator thinking. Which is hard to get away
> from when building a cross-platform abstraction layer, I know, but
> "because it's POSIX" isn't really a good justification either. There
> are some things that POSIX is very bad at (mostly for historic reasons).

I'm more thinking that there is no point in adding features which
have no proven use case yet.

I don't mind adding a boolean flag which costs me nothing and cannot
be buggy. I get much more worried about adding features which I
cannot test and have no proven user base. Better I think to wait
until a proven use case arises.

BTW you may not be aware, but AFIO includes every historical release
of itself within itself via submodule branch pins. In other words, if
you build an application targeting v1 ABI of AFIO, that will work in
perpetuity (or at least until I stop supporting it). AFIO already
ships two versions of itself, v1 and v2.

Hence I don't have the problems other Boost libraries have with
changing API semantics down the line. I can do so without breaking
anyone's code because there is a literal copy of previous AFIO's
shipped every release.

> > Given the Filesystem TS has shipped, I'd say that moment has passed.
>
> Too late to be in the standard (yet), maybe. But one of the roles of
> Boost is to be better than the standard, so it can be the *next*
> standard. :)

If you can persuade Beman I'll follow it. AFIO is intended as a set
of extensions to Filesystem, not as a replacement and as such is
wholly dependent on Filesystem. In other words, whatever Filesystem
does I'll match.

Niall

--
ned Productions Limited Consulting
http://www.nedproductions.biz/
http://ie.linkedin.com/in/nialldouglas/
Gavin Lambert
2015-08-06 00:00:47 UTC
Permalink
On 6/08/2015 05:03, Niall Douglas wrote:
>>> Given the Filesystem TS has shipped, I'd say that moment has passed.
>>
>> Too late to be in the standard (yet), maybe. But one of the roles of
>> Boost is to be better than the standard, so it can be the *next*
>> standard. :)
>
> If you can persuade Beman I'll follow it. AFIO is intended as a set
> of extensions to Filesystem, not as a replacement and as such is
> wholly dependent on Filesystem. In other words, whatever Filesystem
> does I'll match.

That was the idea. I guess we'll have to wait and see what he says next
week.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Paul Harris
2015-07-30 04:05:38 UTC
Permalink
On 29 July 2015 at 18:59, Niall Douglas <***@nedprod.com> wrote:

> On 29 Jul 2015 at 18:09, Gavin Lambert wrote:
>
> > On 29/07/2015 14:06, Niall Douglas wrote:
>
> > > I appreciate you're saying the cost is worth it, but we're thinking
> > > all Boost users here, not just the small minority on Windows Server
> > > 2012 with dedup turned on.
> >
> > I'm not on Server 2012, but this thread caught my attention because I
> > remember encountering a bug that prevented all WinXP clients from
> > accessing deduped files on CIFS shares provided by Server 2012. I think
> > in the end this was a server-side bug related to McAfee and the
> > different protocols used by WinXP vs. Win7, and so clients shouldn't
> > normally be able to see whether files are deduped or not remotely, but I
> > haven't explicitly verified that. If CIFS shares do expose files as
> > dedup reparse points instead of concealing that then it might affect
> > quite a lot of users.
>
> I had understood from the OP that CIFS is exporting the reparse point
> tag to clients, hence the breakage.
>
> The reason, I suspect, that CIFS is being so braindead here is that
> opening a deduped file is more expensive than usual and clients ought
> to know. Which is exactly why I am opposed to treating these things
> as a regular file.
>
>
On the topic of "this file will be slow to read", IMHO this is an
orthogonal issue.
It might be nice to be able to query some sort of "this will be hell slow
to read" status so I could perhaps do something about it,
But the files (slow or not) should still be treated as normal files. This
problem is bigger than just reparse-point files.

Reparse-point files (not symlinks/junctions) are just one type of
maybe-this-will-be-slow files.

Reading off the local underutilised disk is a lot faster than a local disk
suffering high IO,

On monday, "K:" might be a lot slower than the "M:" because the K drive is
a distant server on a slow network, and the M: is a fast server on the
local subnet.
On tuesday, it perhaps is the opposite because I've flown into the site
hosting the K:.

Perhaps a network file read is slow one minute (on 3G network) and fast
just one minute later (switch on WIFI).

But, the current system doesn't tell me anything about that. Nor does
boost treat the K: files as "special" files just because it *might* be slow.
So I don't see why we should start treating (eg dedup) reparse files any
different.

Speed of a read is an orthogonal issue, and often not something that I can
do something about.
If its going to take 5 minutes to read that Word document off the disk,
then that's what it takes. I can't read that file any other way.
If its a problem for my software, I'll need to read it in a nonblocking
way, with the ability to cancel and show progress etc.
But the simple case is to block on the read, and my users are cool with
that because most software works that way.

cheers,
Paul

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Paul Harris
2015-07-30 02:49:38 UTC
Permalink
On 29 July 2015 at 14:09, Gavin Lambert <***@compacsort.com> wrote:

> I'm not sure if it's current, but
> http://blogs.technet.com/b/filecab/archive/2013/02/14/dfsr-reparse-point-support-or-avoiding-schr-246-dinger-s-file.aspx
> seems to suggest the following behaviour as reasonable:
>
> - treating IO_REPARSE_TAG_MOUNT_POINT as directory symlinks
> - treating IO_REPARSE_TAG_SYMLINK as symlinks
> - treating IO_REPARSE_TAG_DEDUP, IO_REPARSE_TAG_SIS, and
> IO_REPARSE_TAG_HSM as regular files
> - treating any other tag as something to be ignored (in most cases)
>
>
I believe the last point is wrong in our context. That blog is talking
about DFS Replication, which is a very special case for reading files. The
fallback ("dehydrating and rehydrating files") is something they'd rather
not do because it would be unpacking files out of 3rd party archival
areas. They'd rather not read and copy content if they can avoid it.

This is so specific that they probably shouldn't be using boost libraries
to do this work.


3rd party companies (like McAfee, Symantec) can request a unique reparse
tag for their custom server software,
When the file is read, the Server uses the reparse tag ID to match up with
the required 3rd party driver to handle the read/write.
For example, Symantec Enterprise Vault v10 has Reparse Tag Value 0x00000010
(observed on server in the wild).

If you like, I can send a screenshot of this particular file, taken on a
Windows 7 client computer, looking at K: (a network share of a Windows
server).

This allows 3rd party companies to make their own fancy cluster/archival
storage solutions.
The only way I can read this particular file is through the first
filename...... there is no symlink to follow. So its not a symlink, there
is no second filename to look at.
In this particular case, the files are archived if not read for X days.
When you first open the file, Symantec replaces the reparse-point file with
the REAL file, and things continue as normal from there. So, similar
purpose to a dedup file, but different implementation.

So I would have written that last point as:
- treating any other tag as a regular file

Because in this case, the Server Admin that I talk to want to install
whatever software they like on the server, and for client software to just
read the files.
They use this software to reduce the storage usage. Thats all.

cheers,
Paul

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gavin Lambert
2015-07-30 03:33:36 UTC
Permalink
On 30/07/2015 14:49, Paul Harris wrote:
> On 29 July 2015 at 14:09, Gavin Lambert <***@compacsort.com> wrote:
>
>> I'm not sure if it's current, but
>> http://blogs.technet.com/b/filecab/archive/2013/02/14/dfsr-reparse-point-support-or-avoiding-schr-246-dinger-s-file.aspx
>> seems to suggest the following behaviour as reasonable:
>>
>> - treating IO_REPARSE_TAG_MOUNT_POINT as directory symlinks
>> - treating IO_REPARSE_TAG_SYMLINK as symlinks
>> - treating IO_REPARSE_TAG_DEDUP, IO_REPARSE_TAG_SIS, and
>> IO_REPARSE_TAG_HSM as regular files
>> - treating any other tag as something to be ignored (in most cases)
>>
>>
> I believe the last point is wrong in our context. That blog is talking
> about DFS Replication, which is a very special case for reading files. The
> fallback ("dehydrating and rehydrating files") is something they'd rather
> not do because it would be unpacking files out of 3rd party archival
> areas. They'd rather not read and copy content if they can avoid it.
[...]
> So I would have written that last point as:
> - treating any other tag as a regular file

If you have a look at the very next paragraph in the quoted message,
that's what I said. :)

(The part you quoted was repeating what the blog said, not as a
recommendation for Boost library behaviour.)



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Paul Harris
2015-07-30 03:54:20 UTC
Permalink
On 30 July 2015 at 11:33, Gavin Lambert <***@compacsort.com> wrote:

> On 30/07/2015 14:49, Paul Harris wrote:
>
>> On 29 July 2015 at 14:09, Gavin Lambert <***@compacsort.com> wrote:
>>
>> I'm not sure if it's current, but
>>>
>>> http://blogs.technet.com/b/filecab/archive/2013/02/14/dfsr-reparse-point-support-or-avoiding-schr-246-dinger-s-file.aspx
>>> seems to suggest the following behaviour as reasonable:
>>>
>>> - treating IO_REPARSE_TAG_MOUNT_POINT as directory symlinks
>>> - treating IO_REPARSE_TAG_SYMLINK as symlinks
>>> - treating IO_REPARSE_TAG_DEDUP, IO_REPARSE_TAG_SIS, and
>>> IO_REPARSE_TAG_HSM as regular files
>>> - treating any other tag as something to be ignored (in most cases)
>>>
>>>
>>> I believe the last point is wrong in our context. That blog is talking
>> about DFS Replication, which is a very special case for reading files.
>> The
>> fallback ("dehydrating and rehydrating files") is something they'd rather
>> not do because it would be unpacking files out of 3rd party archival
>> areas. They'd rather not read and copy content if they can avoid it.
>>
> [...]
>
>> So I would have written that last point as:
>> - treating any other tag as a regular file
>>
>
> If you have a look at the very next paragraph in the quoted message,
> that's what I said. :)
>
> (The part you quoted was repeating what the blog said, not as a
> recommendation for Boost library behaviour.)
>
>
Sorry, you mean this bit :

There was also a note that you can use IsReparseTagNameSurrogate to
> determine if a given reparse point tag is a surrogate (some kind of link)
> or not (treat like regular file). That might be the best option, if it's
> consistent -- and at least for the official MS tags it seems to be;
> MOUNT_POINT and SYMLINK are surrogates and the other types are not.

_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Gavin Lambert
2015-07-30 04:38:31 UTC
Permalink
On 30/07/2015 15:54, Paul Harris wrote:
>> If you have a look at the very next paragraph in the quoted message,
>> that's what I said. :)
>>
>> (The part you quoted was repeating what the blog said, not as a
>> recommendation for Boost library behaviour.)
>>
> Sorry, you mean this bit :
>
> There was also a note that you can use IsReparseTagNameSurrogate to
>> determine if a given reparse point tag is a surrogate (some kind of link)
>> or not (treat like regular file). That might be the best option, if it's
>> consistent -- and at least for the official MS tags it seems to be;
>> MOUNT_POINT and SYMLINK are surrogates and the other types are not.

Yes. Admittedly it was perhaps a bit unclear; I expanded on this in my
reply to Niall earlier today, which does have a recommendation, although
not down to specific APIs.

I'm still assuming that he wants to distinguish between "fast files" and
"slow files" in some way, but both should be "regular files" -- there
should be a separate API to ask if they're fast or not, if that
distinction is useful.



_______________________________________________
Unsubscribe & other changes: http://lists.boost.org/mailman/listinfo.cgi/boost
Loading...