Re: extracting the root domain from a URL [message #171647 is a reply to message #171644] |
Fri, 14 January 2011 02:43 |
Denis McMahon
Messages: 634 Registered: September 2010
Karma:
|
Senior Member |
|
|
On 13/01/11 23:26, Mike wrote:
> I thought .com, .asia, etc. was the "TLD". What would you call the
> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
> name, can you suggest an effective and accurate way to extract it?
>
> In the domain names that use an extra part of the name (e.g.,
> 'site.com.tw' or 'site.co.uk'), I've only ever seen 'com' and 'co'
> used that way. I guess I could check if that center part is 'com' or
> 'co' rather than checking to see if it's strlen() > 3. Though, I'm
> not sure if that's all folks are using. I wonder if there are any
> published conventions for it?
None that I know.
Each registry that issues domain names is free to make up its own rules,
and of course any domain owner can split it further. Here are some
example websites, see if you can derive a common rule:
www.bp.com
www.edf.com
waldorf.cs.man.ac.uk
www.cs.man.ac.uk
www.cs.manchester.ac.uk
waldorf.cs.manchester.ac.uk
www.gammon.com.au
www.police.uk
www.btp.police.uk
www.parliament.uk
www.bt.co.uk
rswww.com
www.merlyn.demon.co.uk
When you've found a foolproof way to parse all of the above, someone
will come along with a sitename that breaks your parser.
Rgds
Denis McMahon
|
|
|