Web Search versus Portal Search
Until now, Web-search technologies have had a rather trivial impact on property search. Property search has remained the preserve of portal infrastructures—most notably RightMove in the UK. This is because Web search, if it is to be really useful within the context of property search, requires “deep crawling” and “high-level–entity parsing”. Furthermore, it requires this to happen for each unique estate-agency Website.
In this blog post we introduce a novel Web-search approach. This approach employs machine-learning techniques to address the problems of Website scaling. In so doing, it circumvents the data difficulties experienced by RightMove’s smaller portal rivals. It also solves a staleness problem which has been highlighted in the press.
Property Search Portals
RightMove is the largest property search site in the UK. It has three traditional rivals: PropertyFinder, PrimeLocation and FindaProperty. All four sites are portal sites, receiving property listings from estate agents by way of datafeeds. Estate agencies pay these portals a monthly subscription (per branch).
Until now, RightMove has enjoyed a first-mover advantage over its rivals—currently enjoying at least three times more property listings and at least 14 times more page views. It has also been argued that RightMove’s advantageous position will strengthen over time. According to Ed Williams, managing director of RightMove, the future is rosy: “I would contest the view that anyone is coming up behind us, I’d argue that we are pulling away.” This is because a great many estate agencies have been unwilling to provide data to multiple portals, all with comparable interfaces. Rather, they back only the clear leader. And, this is RightMove.
There are three new “mashup” portal sites in the UK: OnOneMap, Nestoria and PrimeMove. Finding it difficult to get datafeeds from estate agents, these sites have resorted to getting datafeeds from some of the traditional portals (and, some of the smaller portals also)—in so doing, becoming portals of portal data. The three new portals present their data to consumers using GoogleMap technology.
Web search applied to property search
What about a Googlish Web-search approach to getting property data?
Two difficulties—greatly exacerbated by scale
Google doesn’t index Web data by having Website owners send structured datafeeds. Rather, it retrieves data by “crawling” the Web and then “parsing” out the meaningful content from the crawled Webpages. Until now, this data-retrieval approach hasn’t impacted on property search because of two key difficulties:
(1) Whilst it is easy for general-purpose search engines like Google to find estate-agency Websites, these search engines have not been able to index the property listings within these sites. We say that estate agency listing are hidden within the ’deep’ Web—behind search forms and within dynamically generated Webpages. By accident or design, sites like Halifax are painfully difficult to navigate. And, what is worse, each of these sites is uniquely difficult—each agency site has a different structure.
(2) Suppose you type into the search box of a property search engine: “flat, Mayfair, for sale, under £2m.” Then, you don’t want to be given a Webpage featuring a description of the “Flat” Earth Society, an advertisement for “Mayfair” cigarettes, mention of a fold-up bicycle “for sale”, and mention also of the latest footballer transfer to Scunthorpe United for “£2m”. Nor are you wanting to be given a Webpage featuring a flat in Notting Hill for sale, and a house in Mayfair which is for rent and which just had a £2m refurbishment. The simple text-matching approach used by generalist search engines like Google and Yahoo is not appropriate. You don’t want to be given Webpages which feature particular words. Instead, you’re wanting to be directed to properties which have all the specified attributes - location, price, dimensions, lease-period, etc. (where, sometimes these different attributes are not on the same Webpage). If a property has a list price of £1.8m, then you’re wanting the engine to recognise that £1.8m is less than £2m. You’re also probably wanting the engine to recognise conceptual relatedness—for example, that “balcony” is a type of “outside space”. The basic attributes, like location and price, do not present a problem for portals, because the datafeeds they get comprise ’typed’ data. But, developing an engine which finds particular types of information on the Web, particular attributes, is not easily done—certainly not if we expect the engine to scale to many, many Website formats.
An “artificially intelligent” solution
BytePlay Limited is a search technology company born out of the research labs at Imperial College London. The company’s focus has been on finding high-level entities in the deep Web. High-level entities include properties, jobs, cars and people. These entities have attributes. For example, properties have addresses, jobs have associated salaries, cars have registration years, people have company affiliations. Some of these attributes are quantitative, some are low-level entities (like a particular geographical location) and some are type attributes (like a particular property’s being a freehold property).
Extate is an application of the BytePlay technology within the context of property search. The Extate search engine comprises crawlers that can that can dig deep into estate agency sites. The crawlers are able to deal appropriately with search forms, URLs that change when content doesn’t, and URLs that remain unchanging whilst the associated content does change.
In addition, and most crucially, the Extate engine comprises a machine-learning mechanism for automatically generating custom parsers for each Website which is crawled. Traditional techniques like text-classification (as used by engines like Google to filter out spam) and XPath discernment may be used to construct a Website-specific parser. But, given the number of Websites that need to be parsed, manual construction of Website-specific parsers is not an option. The machine-learning mechanism, which is at the heart of the Extate engine, takes in a set of crawled sample Webpages from a Website. From these it “teaches” itself how to customise the generic parser framework to the particular Website. This automatic parser generation opens the door to entirely new scaling possibilities.
Currently, Extate.co.uk indexes the sites of 394 UK estate agencies - yielding some 150,000 properties. It was the first property search site to feature listings from all ten of the UK’s largest estate-agency groups.
On March 30th, 2007, Extate.co.za went live - featuring listings from South African estate agencies.
Data freshness
Jessie Hewitson, in her Daily Mail article, ’Log on to move out,’ writes: “The main problem with portals, however, is that they rely on estate agents to provide accurate and up-to-date information - i.e. to report properties that have been sold or under offer. For those using portals, nothing is more irritating than finding the perfect home online only to be told by the agent that it was sold the previous week or even month. ’This is a problem, and it is impossible to police every agent that list their properties with us,’ says Springett [CEO of PrimeLocation]”
This portal staleness problem is surprisingly significant. By considering a sample of 112,027 listings on RightMove, the BytePlay team found on 24 December 2006 that approximately 18% of the properties listed were inappropriately classified as compared with the respective estate-agency Websites (being classified as available when in fact under offer or sold, or, to a lesser extent, being classified as under offer when in fact sold - the total number of properties listed as sold or under offer on estate agency sites being approximately 26%).
Provided crawl-parse cycles happen frequently, this staleness problem doesn’t affect Websearch. If there is a change on estate-agency Websites, then this is propagated.
3 Comments
Jump to comment form | comments rss [?] | trackback uri [?]