Re: searching for text in a mixed file (text and hex)

Tech-Archive recommends: Speed Up your PC by fixing your registry



On Mon, 31 Aug 2009 08:34:10 -0400, "Eddards" <eddards@xxxxxxxxxxx> wrote:

I have the following code below to take a selected file and read it's
contents.
The app has a ListBox (IDC_FILELIST) and Static text (IDC_FILESELECTED).
More will be added later to process the read file.
How do I search for a text string when the file does contain 0x00's in it.
It is a mixed file containing text and hex data?
I compile with no errors and the filename does appear at IDC_FILESELECTED.

Please feel free to direct me in other things I am doing improperly.

BOOL CMyAppDlg::OnInitDialog()
{
CDialog::OnInitDialog();

SetIcon(m_hIcon, TRUE); // Set big icon
SetIcon(m_hIcon, FALSE); // Set small icon

CListBox* List = (CListBox*)GetDlgItem(IDC_FILELIST);
****
This is silly. Create a member variable and let MFC bind the control to the variable.
Then you can just call the methods without requiring the ->; e.g., c_List.ResetContent. (I
start all my control variable names with c_)
****
List->ResetContent();
HANDLE hand;
WIN32_FIND_DATA tell;
CString William = "C:\\temp\\*.ppf"; //file I am listing have a .ppf
extension
*****
That would be _T("C:\\temp\\*.ppf")

Note there is no reason to assume there is a temp directory on C:\\ or that it is
accessible, except on your machine. You should probably use the value of the environment
variable TEMP as the location of temporary files. The reason is that on multiuser
machines, each user might have their own temp directory (otherwise there are information
leaks)
*****

LPCTSTR Ppath = William;
****
Why an LPCTSTR? Why not a CString? There is no reason to use an LPCTSTR here. In fact,
it is not at all clear why you are bothering with a variable at all. You could have
written
hand = FindFirstFile(William, &tell);
and it would work just fine.
****
CString insert;

hand=FindFirstFile (Ppath, &tell);
if (hand != INVALID_HANDLE_VALUE){
insert = tell.cFileName;
List->AddString(insert);
}
Search:
if (FindNextFile(hand, &tell)){
insert = tell.cFileName;
List->AddString(insert);
goto Search;
****
WHAT THE #$%! is this doing? Consider the 'goto' to be a very, very, very, DEAD piece of
syntax, used only in unbelievably rare and exotic situations, or by sloppy programmers. It
isn't needed, and you don't want it. The correct code would be
if(hand != INVALID_HANDLE_VALUE)
{
do
{
List.AddString(tell.cFileName);
} while(FindNextFile(hand, &tell));
FindCloseHandle(hand);
}

Note that we do not need to introduce an unnecessary variable, a label, a goto, a
secondary loop, and we don't even try to execute the loop if the FindFirstFile failed; in
addition, the handle is properly closed.

Generally, if you are about to write a 'goto', just say 'this is a mistake'. You will be
right so often that the few times you are wrong will become obvious. In this case, it is
a glaring mistake to have used such an obsolete construct.
*****
}
UpdateData(false);
****
And what, precisely, do you think "UpdateData" is going to accomplish here? As far as I
can tell, absolutely nothing. It is a waste of time and conceptual energy to use it. Note
that the

return TRUE;
}

void CMyAppDlg::OnSelchangeFilelist()
{
char cSelect[500];
*****
And why are you certain that the contents of the file are 8-bit characters? Why do you
think 500 matters in any way?
****
int nSelect;
long lth;
FILE *in;

nSelect=SendDlgItemMessage(IDC_FILELIST, LB_GETCURSEL, 0, 0L);
****
This is some of the worst code I have seen in a long time for getting an element from a
list. The correct code is
nSelect = c_List.GetCurSel();

Note how many fewer characters it uses; it doesn't use something obscure, hard to read,
and which requires a complex lookup of the LB_GETCURSEL message to figure out how to use.
*****
DlgDirSelect((LPSTR) cSelect, IDC_FILELIST);
****
Why are you doing an LPSTR cast? This makes no sense. You have no reason to believe that
this is an ANSI application! The correct code would have been
TCHAR cSelect[500];
DlgDirSelect(cSelect, IDC_FILELIST);
assuming that you wanted to DlgDirSelect at all. But the correct code is
CString Select;
c_List.GetText(nSelect, Select);
which is far easier, does not presume the character type, does not require a known length
(if this is a filename, MAX_PATH would have sufficed), does not require a static
allocation, and in general is the preferred method. You appear to be trying to write
Petzold-style C code in MFC, which is not using MFC correctly at all.
****

GetDlgItem(IDC_FILESELECTED)->SetWindowText(cSelect);
****
c_FileSelected.SetWindowText(cSelect);
****
UpdateWindow();
*****
Why are you updating the current window. Don't you mean to update the window into which
you just stored the information? The correct code would be
c_FileSelected.UpdateWindow();
****
Fname = "C:\\temp\\"; //set directory path
*****
Again, you cannot presume there is a temp directory in C, or that the user has access to
it. There is in general no reason to assume anything about the file path. The correct
approach would be to provide a "File path" edit control, with a "browse" button next to
it.

File Path [______________________] [...]

In the OnBnClickedBrowse handler you would bring up a "find folder" dialog to select the
folder.

If you want a default path, use #define to define a string constant which is the path,
e.g.,
#define DEFAULT_PATH _T("C:\\TEMP\\")
#define DEFAULT_PATTERN _T("*.ppf");

you can then write
CString path(DEFAULT_PATH DEFAULT_PATTERN);
and have a variable with the right contents.

Then you can write, here
Fname = DEFAULT_PATH + cSelect;
and you will get the correct path, the same one you used before, without having to retype
it (and potentially get confused if you change it!)
****
Fname += cSelect; //append filename selected
in = fopen (Fname, "rb"); //open the selected file for binary
reading
****
How do you know it succeeded? I see no test for success here?
****
fseek(in, 0, SEEK_END); //Move File Pointer to EOF
lth = ftell(in); //Get position of FP
fseek(in, 0, SEEK_SET); //Move FP back to beginning of file
edi = new TCHAR[lth]; //create place for data (declared
"LPTSTR edi;" in header file
*****
Why is it you are using TCHAR to hold something that is clearly a set of bytes. Note that
the result of this is that you are creating a buffer twice as large as the file, because
the file size is in bytes.

It would be so much simpler to write it using CFile:

CFile f;
if(!f.Open(path, CFile::fileRead))
return; // or otherwise deal with error and do not continue beyond here

ULONGLONG len = f.GetLength();
CByteArray buffer;
buffer.SetSize(len + sizeof(TCHAR));
if(!f.Read(buffer.GetData(), (UINT)len))
{ /* failed */
f.Close();
return; // or otherwise stop doing things here
} * failed */
buffer[len] = 0;
if(sizeof(TCHAR) > 1)
buffer[len + 1] = 0;
f.Close();

Note that you don't need to "save the filelength" because it is already in the buffer
array; there is no need to free the buffer because that happens automatically when we
leave the scope of this function, etc.

If you want to pass the buffer in, you can pass it in by reference. There is no need to
create additional variables for all kinds of purposes when you need only one.
****
fread (edi, lth, 1, in); //read entire file into edi
fclose (in);
fileSize = (int)lth; //save the filelength for future use
*****
Now you have a few choices. Firstly, do you have an ABSOLUTE GUARANTEE that the file
contents are NEVER going to be Unicode? This is important.

How long are the files? You will do something different if the files are tens of bytes or
tens of megabytes.

Is the search string short?

Here's a very simple algorithm:

int i = 0;
LPSTR p = (LPSTR)buffer.GetData(); // ASSUMES 8-BIT CHARACTERS
CStringA comparand = "Pattern"; // or whatever your pattern is!
while(true)
{ /* scan file */
LPSTR found = strstr(p, comparand);
if(found)
{ /* found it */
....do something
p = found + comparand.GetLength();
} /* found it */
else
p = found + strlen(found) + 1;

if(p >= &buffer[buffer.GetSize()]
break; // all done
} /* scan file */

This, as written, will find every occurence of the string so if the data is
<random hex data>PatternPatternPattern<random hex data>Pattern<random hex data><EOF>

it will find the pattern four times.

If you have tens of megabytes of data, implementing an optimized algorithm will help. For
example, I might use 'strchr', which is faster than 'strstr', looking for the first
character of the pattern. Only when I've identified the first character would I apply
strstr to that point. Other optimizations, including doing something like the Boyer-Moore
string search, might be possible.
joe
****

}

Joseph M. Newcomer [MVP]
email: newcomer@xxxxxxxxxxxx
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm
.


Quantcast