Get path of valid files with regular expression

Question

Get path of valid files with regular expression

Asked 5 years ago

Viewed 755 times

3

Context:

I am monitoring file deletion and inclusion actions in a folder through Class FileSystemWatcher with the intention of registering a Log of the actions performed in it.

Problem:

This folder contains several files and subfolders that I am not interested to monitor, so before recording the Log, I need to check if the action performed, was performed in a relevant file to be saved.

Premises of the expression

The file must be in 2 level.
Ex.: Pasture1 Pasture2.qlqCoise File
The file must belong to the folder Rem (be a direct child)
Ex.: Pasture1 Rem Archive.qlqCoise

Examples of valid and invalid expressions:

"Cliente1\\Rem\\COB0111111.REM.txt"          //Nome Valido
"Cliente1\\Rem\\23123123.REM.txt"            //Nome Valido
"Cliente1\\OK\\COB02222222.REM.txt"          //Invalido
"Cliente1\\Ret\\COB0613062019.REM.txt"       //Invalido
"COB0613062019.REM.txt"                      //Invalido
"Cliente2\\COB0613062019.REM.txt"            //Invalido
"Cliente2\\Rem\\COB0633333.REM.txt"          //Nome Valido
"Cliente2\\Rem\\pasta2\\COB02123123.REM.txt" //Invalido
"Cliente2\\Bla\\Rem\\COB0613062019.REM.txt"  //Invalido
"Cliente1\\COB0613062019.REM.txt"            //Invalido
"Rem\\COB0613062019.REM.txt"                 //Invalido
"Rem" //Invalido
"Cliente3" //Invalido

I performed this check with .split and conditional, but I would like to perform it.

MCVE (Example of solution without regular expression):

static void Main(string[] args)
        {
            List<String> nomes = new List<string>();
            List<String> nomesValidos = new List<string>();

            nomes.Add("Cliente1\\Rem\\COB0111111.REM.txt"); //Nome Valido
            nomes.Add("Cliente1\\Rem\\23123123.REM.txt"); //Nome Valido
            nomes.Add("Cliente1\\OK\\COB02222222.REM.txt"); //Invalido
            nomes.Add("Cliente1\\Ret\\COB0613062019.REM.txt"); //Invalido
            nomes.Add("COB0613062019.REM.txt"); //Invalido
            nomes.Add("Cliente2\\COB0613062019.REM.txt"); //Invalido
            nomes.Add("Cliente2\\Rem\\COB0633333.REM.txt"); //Nome Valido
            nomes.Add("Cliente2\\Rem\\pasta2\\COB02123123.REM.txt"); //Invalido
            nomes.Add("Cliente2\\Bla\\Rem\\COB0613062019.REM.txt"); //Invalido
            nomes.Add("Cliente1\\COB0613062019.REM.txt"); //Invalido
            nomes.Add("Rem\\COB0613062019.REM.txt"); //Invalido
            nomes.Add("Rem"); //Invalido
            nomes.Add("Cliente3"); //Invalido

            foreach (string nome in nomes) {
                var teste = nome.Split("\\");
                if (teste.Length == 3) { //Garanto que estará no 2 nível
                    if (teste[1].ToUpper() == "REM") { //Garanto que o pai direto do arquivo é REM
                        nomesValidos.Add(nome);
                    }
                }
            }
            foreach (string nome in nomesValidos) {
                Console.WriteLine(nome);
            }
            Console.ReadLine();
        }

I tried to make some expressions on Regex101, but I didn’t get very close to what I’d like.

2

The MCVE was better than with Regex.

– Bacco

2019/06/27 at 21:59
@Bacco except the giving part .Add()for each element to create the initial list, and the ToUpper() and the way to write the backslash on the string.

– Maniero

2019/06/27 at 22:02
What is the doubt?

– Maniero

2019/06/27 at 22:06
1

have you considered using the native filter? (2nd parameter) https://docs.microsoft.com/en-us/dotnet/api/system.io.filesystemwatcher.-ctor?view=netframework-4.8#System_IO_FileSystemWatcher__ctor_System_String_System_String_

– Bacco

2019/06/27 at 22:39
@Maniero I put the example string exactly as file Watcher returns in EventArgs, My question was how to build a Regex that would do the same as this MCVE. @Bacco did not know that the filter parameter accepted Pattern, I will read the link you sent.

– Caique Romero

2019/06/28 at 13:08
I think you did good and now looking to get worse. Regex is always worse in everything you can analyze.

– Maniero

2019/06/28 at 13:17
@Maniero you say in the sense of performance?

– Caique Romero

2019/06/28 at 13:31
1

I said all of them. I did not answer because the question clearly asks to do something worse, but I would at least make this code simpler and more correct:https://dotnetfiddle.net/qLM6Mg.

– Maniero

2019/06/28 at 13:43
I was not aware that Regex is always worse, that it should be used only in last case, otherwise I would have kept the initial solution.

– Caique Romero

2019/06/28 at 14:07
Especially now that I’ve given you the one that doesn’t make unnecessary allocations which is what slowed you down.

– Maniero

2019/06/28 at 14:08

Show 5 more comments

1 answer

Browser other questions tagged c# regex

You are not signed in. Login or sign up in order to post.

by hkotsubo • **55,826** points · Answer 1 · 2019-06-28T01:27:22+00:00

Your solution with Split it’s gotten pretty simple and I don’t know if it really needs regex, but anyway, a solution would be:

Regex regex = new Regex(@"^[^\\]+\\Rem\\[^\\]+$");
if (regex.IsMatch(@"Cliente1\Rem\COB0111111.REM.txt"))
{
    Console.WriteLine("Válido");    
}
else
{
    Console.WriteLine("Inválido");  
}

In this case, I’m assuming that the names of the files and directories are "anything but \". For that I use the character class denied [^\\], that basically takes all the characters that nay are among [^ and ]. In this case, the \ should be written as \\, because if it were only [^\], the stretch \] would be interpreted as "the character ]" (see).

And the quantifier + means "one or more occurrences", thus ensuring that it must have at least one character.

regex also uses markers ^ and $, which are respectively the beginning and end of the string. So I guarantee that the string can only have what is in regex.

The method IsMatch returns a boolean indicating if the string corresponds to regex.

I saw you use ToUpper() in your code, so if you want the regex to be case insensitive, just use the RegexOptions correspondent:

Regex regex = new Regex(@"^[^\\]+\\Rem\\[^\\]+$", RegexOptions.IgnoreCase);

Thus, the directory name can be rem, Rem, REM, or any other combination of upper and lower case letters, which the match.

If you want to get the file name (without the folder names), just put the corresponding chunk in parentheses:

Regex regex = new Regex(@"^[^\\]+\\Rem\\([^\\]+)$", RegexOptions.IgnoreCase);
Match match = regex.Match(@"Cliente1\REM\COB0111111.REM.txt");
if (match.Success)
{
    Console.WriteLine(match.Groups[1].Value);
}

Note that in the last section parentheses have been added: ([^\\]+). They form a catch group, and this makes it possible to retrieve the corresponding string chunk.

Since it is the first capture group (because it is the first pair of parentheses that appears in the regex), we can recover it through match.Groups[1], whereas the match was returned by the method Match.

In the code above, will be printed "COB0111111.REM.txt". Adapting to your example (in which there is a list of names), could be so:

List<string> nomes = ....
List<string> nomesValidos = new List<string>();
Regex regex = new Regex(@"^[^\\]+\\Rem\\([^\\]+)$", RegexOptions.IgnoreCase);
foreach (string nome in nomes)
{
    Match match = regex.Match(nome);
    if (match.Success)
    {
        nomesValidos.Add(match.Groups[1].Value);
        // ou Add(nome), se quiser o nome completo do arquivo
    }
}

^{Example in Ideone.com}

If the entries are controlled and you know that you will only receive filenames, I believe this is enough. The problem is that [^\\] means "any character other than \", then it will accept any of the same (including special characters, line breaks, etc.), and can accept strings that are not necessarily filenames and directories (example). But if you know these cases don’t occur, it’s okay to use it.

If you want, you can be more restrictive. For example, instead of [^\\], could use [a-z0-9\.\-]+ (one or more letters, numbers, dots, or hyphens), so regex would only accept filenames that contain those characters. But this regex also has its problems, because it is quite naive, since it accepts strings like ----- or ..... (see).

But I believe that for the cases presented, it does not seem worth complicating so much the regex, and the above options should be enough.

Finally, strengthen that your solution with Split it seems to me to be the simplest. Even because split and match are only two sides of the same coin: in the split I say what I don’t want to be in the final result (the separator \), in the match I say what I want (the parts that are between the \). And it is often easier to define one of the two (in your specific case, the split I find it easier).