我知道有无数个线程在问这个问题,但是我一直找不到能够帮助我解决这个问题的线程。
我基本上是在尝试解析大约10,000,000个URL的列表,确保它们按照以下条件有效,然后获取根域URL。该列表几乎包含了您可以想象的所有内容,包括类似的东西(以及预期的格式化网址):
biy.ly/test [VALID] [return - bit.ly] example.com/apples?test=1&id=4 [VALID] [return - example.com] host101.wow404.apples.test.com/cert/blah [VALID] [return - test.com] 101.121.44.xxx [**inVALID**] [return false] localhost/noway [**inVALID**] [return false] www.awesome.com [VALID] [return - awesome.com] i am so awesome [**inVALID**] [return false] http://404.mynewsite.com/visits/page/view/1/ [VALID] [return - mynewsite.com] www1.151.com/searchresults [VALID] [return - 151.com]
有人对此有任何建议吗?
^(?:https?://)?(?:[a-z0-9-]+\.)*((?:[a-z0-9-]+\.)[a-z]+)
说明
^ # start-of-line (?: # begin non-capturing group https? # "http" or "https" :// # "://" )? # end non-capturing group, make optional (?: # start non-capturing group [a-z0-9-]+\. # a name part (numbers, ASCII letters, dashes) & a dot )* # end non-capturing group, match as often as possible ( # begin group 1 (this will be the domain name) (?: # start non-capturing group [a-z0-9-]+\. # a name part, same as above ) # end non-capturing group [a-z]+ # the TLD ) # end group 1
http://rubular.com/r/g6s9bQpNnC