im working on a project that will receive a string of a full address but I need to find a way to break it down into seperate varibles regardless of punctuation, for example if I was doing the space needle in Seattle the address string would be
string = “400 BROAD Street, SEATTLE WA 98109, United States”
but I’m trying to break it down into seperate variables like this
streetNumber = “400 BROAD Street”
city = “SEATTLE”
state = “WA”
zip = “98109”
country = “United States”
Has anyone got an idea of how to do this? its been busting my brain for a day now. Ive considered maybe passing it to an api like google maps but unfortunately they charge.
Really appreciate any tips or advice anyone might have
If the input data is consistently formatted strings, ie they always match a specific pattern, parsing is easy and can be halted on malformed data with a useful error message (for example, JSON). If it isn’t, then it’s extremely difficult to parse. This indicates the latter:
With that, there’s a high possibility you’ve just made your job almost impossible. If there is no consistent structure, you’re trying parse human language.
Following will be extremely fragile, but assuming just US addresses (this all just gets exponentially harder the more countries you add):
The address probably ends with the country. A very large % of addresses may not, but you’re can search near the end of the string for “United States” and any possible variations.
It has to be a search at the end of the string because the first line/s of the address (name of building, street, town, area etc) can all also include the name of country, so just searching the string will cause false matches.
If that is located, remove from next search string. Then a few options, I think I would try in this order:
- can look near the end for something that matches the format of a US zip code.
- can check for state code against a list of state codes. The state code should be on its own in the address, either seperated by a space on both sides, or commas or something similar. This is complicated by the fact that other parts of the address could contain the same sequence of letters in the same format eg as part of the building name, so need to guess where it should be.
- city name. Without punctuation this is very difficult to find (I assume “city” also covers towns as well?), so you want to match against a list. But that list could be very long and needs to include variations. If you’ve guessed the state at this point, that narrows down things a lot, so this should happen after that. Again, a city name can appear multiple other times in the address, so have to guess it will be at the end of the remaining string.
Each time a guess is made, that bit of the string + neighbouring delimiters are excised from the next search. Then you’re normally working in left from the right
- the first line/s can be literally anything and are very difficult to parse, but you can possibly assume the remaining string is that (what about addresses with multiple first lines? Like company name/number in building/building name/complex/street, there’s 5 lines before even get to town/city
Anything you can guarantee about the input needs to be added to the above to enable you to narrow down the results.