Re: File Searching, how to speed it up?



Robert wrote:
Robert wrote:
I have a win32 C# app that needs to recursively search for a particular file type really fast. It always searches the same place for these files.

I'm using Directory.GetDirectories and Directory.GetFiles to do it currently. Because I know the location is the same every time I build a cache of all the files and directories. If a directory changes I update its contents on startup of the app.

Unfortunately when I do a last-date-of-modification check on a directory it only reports on changes of files that are directly under it. So I have to check every single directly rather then simply checking the parent directory.

While I get a significant improvement when none of the directories change (its much faster to load a file with all the entries in it then to do a directory search). In particular the first time the app loads up and has to search every file (around 100,000 and growing) which is painfully slow (2 or 3 minutes to start up the app).

What techniques/APIs can I use to speed up file searching?

Is there a way to more directly look at the raw data that make up the file-tables?

Change your filesystem is the easy answer. FAT32 is 2x-8x faster than ntfs.

From my generic WD 250gb HD with a 32gb partition:

64k Clusters NTFS Defrag=0 08m 22.31 .jpeg
64k Clusters NTFS Defrag=1 08m 11.82 .jpeg
Default Clusters NTFS Defrag=0 19m 23.75 .jpeg
Default Clusters NTFS Defrag=1 05m 44.15 .jpeg
FAT32 Defrag=0 02m 48.25 .jpeg
FAT32 Defrag=1 02m 42.78 .jpeg

If you have more data than FAT32 can hold, use ntfs with a larger cluster size.
64k clusters have a much better worst case, but a lower best case..

The above results were with 100,000 files of 275 kb each.
Each of the results above was for 10 runs, with the filecache cleared between runs.

last, turning off 8.3 filenames helps a bit.
Thanks for your comments and effort put into this! Its very interesting that FAT32 would be faster.

This was paid research for a client.. The results were the jpeg's on my harddrive.
The filenames are informative however.

NTFS has unicode file names (2x the size) and a bunch of metadata. Something
on the order of 1K (yes K) per file.. It also does security checks. All the
above times were on xp64 as admin with ownership of the files.

FAT32 is old, and was designed for old slow computers with small HD's.
NTFS was designed for servers, security, and file safety( journals, etc)

Unfortunately I can't change the the file system (unless it can be done per folder). If possible I would like the program to start up more quickly then 2minutes. Note that I don't need to read each file, I just need their names.

File system is per partition. So make a partition for your data..
Or buy another drive. They are cheap.

I can't do that. I don't have access to the users machines.


The above times were for simply scanning the dir's and building
a list of all the files.

I'm think I may have to run this check in a separate thread, and have the files appear slowly. However I would rather have all files available as quickly as possible. Is there a async mode like you can do when loading files?

A background thread might help. User could do something with the
partial list of files.

With async you would have to be careful not to overwelm the disks.
If the average disk queue length goes over 2 it does terrible things
to SQL server performance.. On the other hand, if you have server
drives (SCSI/SAS or even SATA with NCQ) you might get a doubling
of performance due to less disk head movement (elevator seeks)
if you are doing random IO.

Defragging does help.


Also where all the files in the same folder or nested under that folder?

Nested, 120 files per folder, 60 of those folders for each parent

Is it possible to defrag a particular folder?

No.

Have you tried it with windows indexing?

This is for the contents of files, so I do not think this would help.

I didn't know that. Thanks.

BTW, I did find this:

http://community.prestwood.com/ASPSuite/KB/document_view.asp?qid=100273

indexing seems to only work if you put the magic "#filename=*.doc". and a # search is a hell of a lot faster then an @ search. I think indexing improved the speed. I've yet to try this search in code. Note: it seems simply turning a drive as indexed in not enough to enable index searching.



good luck.


.



Relevant Pages

  • hai...find me a solution in M5000 server
    ... series server. ... B B B B B i want to do install one more solaris Over there. ... Solaris cannot see all drives on Areca RAID controller ... popped it (a single disk at this point) into machine1. ...
    (SunManagers)
  • Re: sunmanagers Digest, Vol 22, Issue 12
    ... series server. ... B B B B B i want to do install one more solaris Over there. ... Solaris cannot see all drives on Areca RAID controller ... popped it (a single disk at this point) into machine1. ...
    (SunManagers)
  • Re: Advice needed, hard drive 0 is failing on a RAID1 SBS2003 server, any tips welcome because Ive n
    ... So far no more disk error events either. ... I will be able to get on site at a convenient time for the customer within a week and restart the server and go into the BIOS and see the state of the disks. ... > Pair of Maxtor PATA hard drives in RAID1. ...
    (microsoft.public.windows.server.sbs)
  • Re: Server Locks Up at Exactly 4% into a Disk Sychronization
    ... It ended up being hardware. ... and the new server worked perfectly. ... second physical disk, and added it as a mirror to the first. ... mirroring drives is out side the scope of our support." ...
    (microsoft.public.windows.server.general)
  • Re: Help! - No usable space, but plenty of unallocated space!!
    ... this case, since you only have the C: partition formatted, the secondary ... your server going until you troubleshoot and implement a fix. ... SCSI drives and I think there is a hardware RAID implementation (Dont ... Each hard disk is about 35GB, giving me a total of 70GB. ...
    (microsoft.public.windows.server.sbs)