Hello to everyone!
In my ActiveDirectory I have about 100k users.
Each of the users has a “Description” field that looks similar to an example below
"The Paramount Company FullName \ Paramount Company \ The Department One FullName \ Dep1 \ Sales Department of the Department One \ Sales Department \ South Division of the Sales Department \ South Division"
and so forth.
This looong (over 250 characters) string is to be somehow parsed and shortened into just 32 characters and kept to a unique User’s ID in a csv file
The simplest idea would be to get just the last value, like Description.split().[-1] (powershell notation)
UserID, Dep32
ivan_ivanov, SouthDivision
But that will not guarantee uniqueness, obviously.
Could someone recommend an approach that would at least rise a probability of the resulting string (Dep32) to be unique?
I guess I’m misunderstanding. Is “ivan_ivanov” already a unique identifier? Or does the system allow more than one “ivan_ivanov”? Does “South Division” have to be unique too or just in combination with “ivan_ivanov”. And does this unique ID have to be human readable, or is it just for the DB?
While shortening LongDeptDescription we should still maintain the uniqueness of the department name as our users not to get into another department.
I don’t see how that would be foolproof. Too many chances for them to rename a subdivision and suddenly you have a collision.
Does this need to be unique to be a DB index? Or just because you want to use it?
If it’s needed for and index, you may want to check and see if they already have codes for different departments. Every large corp I’ve been a part of had specific codes for all of these. And you could use those in your index and just parse them out for display.
But if you really must have some human readable index (again, I think those functions should be split) then you could have a table of different departments/subdepts/etc and what their abbreviations are. From there you could make sure that the abbreviations are unique. Of course the usernames would have to be unique too.
It’s not possible to guarantee uniqueness that way though; there has to be some form of unique id, and that id has to be a certain length to reduce chance of collisions. You can define some part of the (string) description as always needing to be unique, but given the amount of users, the likelihood of race conditions leading to duplicated keys is relatively high. It means some extra logic, but having a unique randomised string of 32 chars, or attaching a shorter randomised string to some key piece of the description is your best bet imo
An approach may be to use some sort of reference system.
This example use a base 32 numbering system
eg,DEP[A-Z,0-9][A-Z,0-9]
DEPAA = department 1
DEPAB = department 2
…
DEP99 = department 1024
dep99 divAB usAAA … = department 1024/1024, division 2/1024, countrycode city 1/32768.
Based on the provided description, probably leave the last 4 characters as for unique numbers allowing a further 1048576 as unique identifier
You could do better than base 36 if you made it case sensitive (base 62, in fact). Four digits and you’ve got yourself almost 15 million combinations for some decent scalability. And if you’re obsessed with making it human-readable, just append some abbreviated versions of the other data (dept name abbreviation + truncated name, perhaps) to that.
So you’d have something like “aM8k-SalesSouth-IvanIvanovicIvan” (-ov got truncated). As long as the dept abbreviations were kept short and you were smart about how to truncate the name (e.g. don’t include middle names like above), you should usually end up with something pretty readable.
…But that whole approach is just messy, like mixing ice cream with gravy. Better to use two separate fields and concatenate them together if you need to (e.g. in a URL). Besides, what if Ivan transfers to Operations? Then your whole “ID” would be ruined if you used the ice cream gravy approach.