extracting the root domain from a URL [message #171642] |
Thu, 13 January 2011 21:50 |
Mike
Messages: 18 Registered: December 2010
Karma: 0
|
Junior Member |
|
|
Given any valid URL, I'd like to extract the root domain like this:
http://www.site.com = site.com
http://xxx.yyy.site.com = site.com
http://subdomain.site.com = site.com
http://www.site.com.tw = site.com.tw
http://xxx.yyy.site.com.asia = site.com.asia
http://subdomain.site.com.af = site.com.af
I've written some code (below), which works on the examples, but falls
apart if the domain name is three characters long (e.g., ibm.com).
Does someone know of a way to do this even with three letter domain
names?
Here's my current code:
function getRootDomain($url)
{
// Get rid of junk
if(!isValidUrl($url)) { return false; }
// parse the url to get the host name
$parsedUrl = parse_url($url);
// break it apart by the '.' and flip it around
$parts = array_reverse(explode('.',$parsedUrl['host']));
// remove all but the last three parts (e.g., 'www.site.com' or
'site.com.tw' or if there's only two 'site.com')
while(count($parts) > 3)
{
array_pop($parts);
}
// if there are three parts, and the middle part is more than 3
characters, then ditch the first part
// example: www.site.com - 'site' > 3 so ditch the 'www', site.com.tw
= 'com' isn't > 3, so keep the 'site'
if( isset($parts[2]) && strlen($parts[1]) > 3) { unset($parts[2]); }
// pass back the reassembled root domain name
return implode('.',array_reverse($parts));
}
|
|
|
Re: extracting the root domain from a URL [message #171643 is a reply to message #171642] |
Thu, 13 January 2011 22:29 |
Captain Paralytic
Messages: 204 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On Jan 13, 9:50 pm, Mike <mpea...@gmail.com> wrote:
> Given any valid URL, I'd like to extract the root domain like this:
>
> http://www.site.com = site.comhttp://xxx.yyy.site.com = site.comhttp://subdomain.site.com = site.comhttp://www.site.com.tw = site.com.twhttp://xxx.yyy.site.com.asia = site.com.asiahttp://subdomain.site.com.af = site.com.af
>
You can't extract the root domain like that because the "root domain"
is the .com, .asia, .tw, .af part.
|
|
|
Re: extracting the root domain from a URL [message #171644 is a reply to message #171643] |
Thu, 13 January 2011 23:26 |
Mike
Messages: 18 Registered: December 2010
Karma: 0
|
Junior Member |
|
|
I thought .com, .asia, etc. was the "TLD". What would you call the
'site.com' or 'site.co.uk' portion of the url? Regardless of the
name, can you suggest an effective and accurate way to extract it?
In the domain names that use an extra part of the name (e.g.,
'site.com.tw' or 'site.co.uk'), I've only ever seen 'com' and 'co'
used that way. I guess I could check if that center part is 'com' or
'co' rather than checking to see if it's strlen() > 3. Though, I'm
not sure if that's all folks are using. I wonder if there are any
published conventions for it?
Mike
|
|
|
Re: extracting the root domain from a URL [message #171647 is a reply to message #171644] |
Fri, 14 January 2011 02:43 |
Denis McMahon
Messages: 634 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On 13/01/11 23:26, Mike wrote:
> I thought .com, .asia, etc. was the "TLD". What would you call the
> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
> name, can you suggest an effective and accurate way to extract it?
>
> In the domain names that use an extra part of the name (e.g.,
> 'site.com.tw' or 'site.co.uk'), I've only ever seen 'com' and 'co'
> used that way. I guess I could check if that center part is 'com' or
> 'co' rather than checking to see if it's strlen() > 3. Though, I'm
> not sure if that's all folks are using. I wonder if there are any
> published conventions for it?
None that I know.
Each registry that issues domain names is free to make up its own rules,
and of course any domain owner can split it further. Here are some
example websites, see if you can derive a common rule:
www.bp.com
www.edf.com
waldorf.cs.man.ac.uk
www.cs.man.ac.uk
www.cs.manchester.ac.uk
waldorf.cs.manchester.ac.uk
www.gammon.com.au
www.police.uk
www.btp.police.uk
www.parliament.uk
www.bt.co.uk
rswww.com
www.merlyn.demon.co.uk
When you've found a foolproof way to parse all of the above, someone
will come along with a sitename that breaks your parser.
Rgds
Denis McMahon
|
|
|
Re: extracting the root domain from a URL [message #171648 is a reply to message #171644] |
Fri, 14 January 2011 07:51 |
Luuk
Messages: 329 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On 14-01-11 00:26, Mike wrote:
> I thought .com, .asia, etc. was the "TLD". What would you call the
> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
> name, can you suggest an effective and accurate way to extract it?
>
> In the domain names that use an extra part of the name (e.g.,
> 'site.com.tw' or 'site.co.uk'), I've only ever seen 'com' and 'co'
> used that way. I guess I could check if that center part is 'com' or
> 'co' rather than checking to see if it's strlen() > 3. Though, I'm
> not sure if that's all folks are using. I wonder if there are any
> published conventions for it?
>
> Mike
>
All TLD names are listed here:
http://www.icann.org/en/registries/top-level-domains.htm
For site.co.uk, the tld-name listed is .uk
read http://en.wikipedia.org/wiki/.uk which explains why its not .gb and
why co.uk is used...
--
Luuk
|
|
|
Re: extracting the root domain from a URL [message #171652 is a reply to message #171642] |
Fri, 14 January 2011 08:43 |
alvaro.NOSPAMTHANX
Messages: 277 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
El 13/01/2011 22:50, Mike escribió/wrote:
> Does someone know of a way to do this even with three letter domain
> names?
[...]
> if( isset($parts[2])&& strlen($parts[1])> 3) { unset($parts[2]); }
It'd dare say it's 3 because you hard-coded a 3 in your code.
Whatever, as Denis already explained, the number of parts in domains you
can buy do not follow any mathematical rule. E.g.: you can register
"example.es", "example.nom.es" and "example.co.uk" but not
"example.nom.uk" or "example.co.es".
The only way to do it off-line is to compile a full list from all
registrers around the world and keep it updated as the rules are being
changed across time.
--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
|
|
|
|
Re: extracting the root domain from a URL [message #171655 is a reply to message #171644] |
Fri, 14 January 2011 11:17 |
Captain Paralytic
Messages: 204 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On Jan 13, 11:26 pm, Mike <mpea...@gmail.com> wrote:
> I thought .com, .asia, etc. was the "TLD". What would you call the
> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
> name, can you suggest an effective and accurate way to extract it?
It is and it is also the root. It may not be the organisation root,
but it is the domain root.
You will see a lot of companies that have something.uk.com. This is
because someone registered uk.com and now sells subdomains to other
companies. you have no way of knowing how many domain elements you
need to get to the organisation root.
http://www.answers.com/topic/root-domain
|
|
|
Re: extracting the root domain from a URL [message #171666 is a reply to message #171644] |
Fri, 14 January 2011 23:41 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Mike wrote:
> I thought .com, .asia, etc. was the "TLD".
It is, although that includes a trailing dot that is usually omitted.
> What would you call the 'site.com' or 'site.co.uk' portion of the url?
That is the second-level domain (which is only loosely related to URLs, this
is about is DNS). BTW, the "domain root", if any, was the trailing dot of
any domain name that is usually not written.
> Regardless of the name, can you suggest an effective and accurate way to
> extract it?
Yes.
> In the domain names that use an extra part of the name (e.g.,
> 'site.com.tw' or 'site.co.uk'),
The third-level domain, or people are often using the umbrella term,
sub-level domain.
> I've only ever seen 'com' and 'co' used that way.
There are several others.
> I guess I could check if that center part is 'com' or
> 'co' rather than checking to see if it's strlen() > 3.
Bad idea.
> Though, I'm not sure if that's all folks are using. I wonder if there are
> any published conventions for it?
Each top-level domain has an assigned authority, usually called a NIC
(Network Information Center), which writes the book on the domain names
under their control. Most if not all of them have a website where you would
find those rules. IANA (Internet Assigned Numbers Authority) maintains a
list of the registered TLDs and their assigned authorities, you can find it
on their website. Really, STFW.
PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300dec7(at)news(dot)demon(dot)co(dot)uk> (2004)
|
|
|
Re: extracting the root domain from a URL [message #171667 is a reply to message #171654] |
Fri, 14 January 2011 23:54 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Jonathan Stein wrote:
> Den 13-01-2011 22:50, Mike skrev:
>> Given any valid URL, I'd like to extract the root domain like this:
>>
>> http://www.site.com = site.com
>> http://xxx.yyy.site.com = site.com
>> http://subdomain.site.com = site.com
>> http://www.site.com.tw = site.com.tw
>> http://xxx.yyy.site.com.asia = site.com.asia
>> http://subdomain.site.com.af = site.com.af
>
> If you need a fast lookup, you'll probably need to maintain a database
> with rules for each TLD you intend to support.
>
> Otherwise you could go for a series of "whois" lookups. If whois
> succeeds for "site.com.af" but fails for "subdomain.site.com.af", then
> "site.com.af" was probably what you was looking for.
WHOIS would be overkill here and it is not universally supported anymore
(for example, DENIC dropped WHOIS support a few years ago except via their
website because of misuse), so you would get false positives.
The proper internet service to use here is DNS itself, of course. You would
make a connection to port 53/udp on a nameserver that does recursive DNS
lookups (unless you want to consider the local host's DNS configuration) and
request information about the `A' (IPv4) or `AAAA' (IPv6) resource record of
the domain-part part (sic!). Repeat adding a sub-level component until the
query is successful or the full domain-part is reached.
PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
|
|
|
Re: extracting the root domain from a URL [message #171668 is a reply to message #171655] |
Fri, 14 January 2011 23:59 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Captain Paralytic wrote:
> On Jan 13, 11:26 pm, Mike <mpea...@gmail.com> wrote:
>> I thought .com, .asia, etc. was the "TLD". What would you call the
>> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
>> name, can you suggest an effective and accurate way to extract it?
>
> It is and it is also the root. It may not be the organisation root,
> but it is the domain root.
>
> […]
> http://www.answers.com/topic/root-domain
I suggest you re-read that (correct) answer you are referring to as it
proves you wrong.
PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
|
|
|
Re: extracting the root domain from a URL [message #171696 is a reply to message #171667] |
Sun, 16 January 2011 09:29 |
Jonathan Stein
Messages: 43 Registered: September 2010
Karma: 0
|
Member |
|
|
Den 15-01-2011 00:54, Thomas 'PointedEars' Lahn skrev:
> WHOIS would be overkill here and it is not universally supported anymore
> (for example, DENIC dropped WHOIS support a few years ago except via their
> website because of misuse), so you would get false positives.
We don't need the actual WHOIS data, and I believe that even DENIC
provides simple WHOIS status information.
> The proper internet service to use here is DNS itself, of course.
Depending on how Mike defines "root domain", I don't think you can do
this reliable from DNS.
If you look up www.example.com, and there is no A or AAAA record for
example.com, you'll get www.example.com as the "root domain".
Another approach would be to ask for NS records and define "root domain"
as the highest level with an NS record. This could however include
department.example.com, which might not be what Mike intended...
Regards
Jonathan
|
|
|
Re: extracting the root domain from a URL [message #171698 is a reply to message #171668] |
Sun, 16 January 2011 13:11 |
Captain Paralytic
Messages: 204 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On Jan 14, 11:59 pm, Thomas 'PointedEars' Lahn <PointedE...@web.de>
wrote:
> Captain Paralytic wrote:
>> On Jan 13, 11:26 pm, Mike <mpea...@gmail.com> wrote:
>>> I thought .com, .asia, etc. was the "TLD". What would you call the
>>> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
>>> name, can you suggest an effective and accurate way to extract it?
>
>> It is and it is also the root. It may not be the organisation root,
>> but it is the domain root.
>
>> […]
>> http://www.answers.com/topic/root-domain
>
> I suggest you re-read that (correct) answer you are referring to as it
> proves you wrong.
>
> PointedEars
Here we go again! No it doesn't. There are 2 definitions there and
only the second one refers to the internet. That definition quite
clearly states: "The starting point of the top level domain structure
on the Internet. It is the root, or entry point, to
the .com, .org, .net, etc. domains."
|
|
|
Re: extracting the root domain from a URL [message #171699 is a reply to message #171698] |
Sun, 16 January 2011 14:21 |
Jerry Stuckle
Messages: 2598 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On 1/16/2011 8:11 AM, Captain Paralytic wrote:
> On Jan 14, 11:59 pm, Thomas 'PointedEars' Lahn<PointedE...@web.de>
> wrote:
>> Captain Paralytic wrote:
>>> On Jan 13, 11:26 pm, Mike<mpea...@gmail.com> wrote:
>>>> I thought .com, .asia, etc. was the "TLD". What would you call the
>>>> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
>>>> name, can you suggest an effective and accurate way to extract it?
>>
>>> It is and it is also the root. It may not be the organisation root,
>>> but it is the domain root.
>>
>>> […]
>>> http://www.answers.com/topic/root-domain
>>
>> I suggest you re-read that (correct) answer you are referring to as it
>> proves you wrong.
>>
>> PointedEars
>
> Here we go again! No it doesn't. There are 2 definitions there and
> only the second one refers to the internet. That definition quite
> clearly states: "The starting point of the top level domain structure
> on the Internet. It is the root, or entry point, to
> the .com, .org, .net, etc. domains."
Paul, haven't you learned by now the PE can't read, but is NEVER wrong.
--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================
|
|
|
Re: extracting the root domain from a URL [message #171700 is a reply to message #171696] |
Sun, 16 January 2011 16:57 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Jonathan Stein wrote:
> Den 15-01-2011 00:54, Thomas 'PointedEars' Lahn skrev:
>> WHOIS would be overkill here and it is not universally supported anymore
>> (for example, DENIC dropped WHOIS support a few years ago except via
>> their website because of misuse), so you would get false positives.
>
> We don't need the actual WHOIS data,
What exactly were you suggesting, then?
> and I believe that even DENIC provides simple WHOIS status information.
I have just found out that they are doing it *again* *now* even using the
proper protocol; now they are only omitting the Admin-C record in the
output. It is a curious development, though; thank you for making me try
again.
>> The proper internet service to use here is DNS itself, of course.
>
> Depending on how Mike defines "root domain", I don't think you can do
> this reliable from DNS.
Yes, you can. That is exactly what DNS is for.
> If you look up www.example.com, and there is no A or AAAA record for
> example.com,
There is here, although not an authoritative one:
$ dig A example.com
; <<>> DiG 9.6-ESV-R3 <<>> A example.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61811
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
;; QUESTION SECTION:
;example.com. IN A
;; ANSWER SECTION:
example.com. 172689 IN A 192.0.32.10
;; Query time: 21 msec
;; SERVER: 212.60.61.246#53(212.60.61.246)
;; WHEN: Sun Jan 16 17:46:42 2011
;; MSG SIZE rcvd: 45
> you'll get www.example.com as the "root domain".
Not here.
> Another approach would be to ask for NS records and define "root domain"
> as the highest level with an NS record. This could however include
> department.example.com, which might not be what Mike intended...
Exactly. It would be best to find out the highest-level domain name with a
name server (which WHOIS can, but need not provide) and ask that name server
about the sub-level domain. Or trust DNS so far as to ask the nearest name
server for a recursive lookup.
PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann
|
|
|
Re: extracting the root domain from a URL [message #171701 is a reply to message #171698] |
Sun, 16 January 2011 17:04 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Captain Paralytic wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Captain Paralytic wrote:
>>> On Jan 13, 11:26 pm, Mike <mpea...@gmail.com> wrote:
>>>> I thought .com, .asia, etc. was the "TLD". What would you call the
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> 'site.com' or 'site.co.uk' portion of the url? Regardless of the
>>>> name, can you suggest an effective and accurate way to extract it?
>>
>>> It is and it is also the root. It may not be the organisation root,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> but it is the domain root.
^^^^^^^^^^^^^^^^^^^^^
>>> […]
>>> http://www.answers.com/topic/root-domain
>>
>> I suggest you re-read that (correct) answer you are referring to as it
>> proves you wrong.
>
> Here we go again! No it doesn't.
Yes, it does.
> There are 2 definitions there and only the second one refers to the
> internet. That definition quite clearly states: "The starting point
^^^^^^^^^^^^^^^^^^
> of the top level domain structure on the Internet. It is the root, or
> entry point, to the .com, .org, .net, etc. domains."
^^^^^^^^^^^^^^^
Yes, the *entry point* *to* those domains; not the domains themselves.
To be precise, the root domain is the trailing dot that is usually omitted,
as in foo.example.com.<-------------------------'
Notice the difference?
PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann
|
|
|
Re: extracting the root domain from a URL [message #171705 is a reply to message #171700] |
Sun, 16 January 2011 22:45 |
Jonathan Stein
Messages: 43 Registered: September 2010
Karma: 0
|
Member |
|
|
Den 16-01-2011 17:57, Thomas 'PointedEars' Lahn skrev:
>> If you look up www.example.com, and there is no A or AAAA record for
>> example.com,
>
> There is here, although not an authoritative one:
example.com was - as the name implies - just an example.
In general, if there is no A record for the 2nd level, your method would
return the 3rd level (or higher), even for TLDs where the 2nd level is
more likely to be what Mike wants as the "root domain".
- But I think Mike got some inspiration now, and if Google can't help
him the rest of the way, he's welcome back.
Regards
Jonathan
|
|
|