Product Entity Resolution in E-Commerce

Product Entity Resolution in E-Commerce

Copyright: © 2014 |Pages: 14
DOI: 10.4018/978-1-4666-5198-2.ch016
OnDemand:
(Individual Chapters)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

With the rapid development of e-commerce, there is a huge amount of commodity data on the Internet. Users are always spending a lot of time looking for the exact product. Therefore, finding products representing the same entity is an effective way to improve the efficiency of purchasing. Due to frequently missing or wrong values and subjective difference in description, traditional method of entity resolution may not have a good result on e-commerce data. Therefore, a set of algorithms are proposed in data cleaning, attribute and value tagging, and entity resolution, which are specialized for e-commerce data. In addition, user’s actions are collected to improve the classification result. The chapter evaluates the effectiveness of the proposed algorithms with real-life datasets from e-commerce sites.
Chapter Preview
Top

Introduction

With the rapid development of the Internet, more and more users choose to purchase products on the network. However, with the advent of C2C online shop, the amount and the variety of product increase rapidly. The same product may have a different description in different sellers, which a lot of products exist lack of information, error information and other issues. For example, there are 6 million sellers and almost 0.8 billion products in taobao.com, in which most of the products are input manually. This makes the product information very subjective differences in description and users may take a long time to find the exact product that they want.

Therefore, the entity classification on e-commerce data has high value. Entity resolution technology is relatively mature, but they are mostly for the structured data or relational data, as in Chapter 5 and Chapter 6, they cannot get the desired results on e-commerce data. Compared to the traditional entity identification technology, entity classification on e-commerce data faces greater challenges:

  • 1.

    Incomplete description of product. E-commerce data generally include the product name, price, shipping costs, seller and the description information of product. Product name contains very little amount of information and the description also lack of detailed information. Consider “Camera”, for example, the title contains general camera model, brand information, and this information exists in the natural text, computer hardly know which attribute each word represents. Most of information in description for business promotional information, the lack of a camera pixel, lens type.

  • 2.

    Product description inconsistent. Different sellers have a different description of the same product. For example, for a camera Powershot A4000 IS, some sellers describe it as Powershot A4000 IS while some describe it as A4000.

  • 3.

    Product description miscellaneous. The description is mixed with information of other products, such as ” Similar products that are hot recommended like (Canon) PowerShot A3300 Camera 1600 Megapixel 3 inch LCD promotional price 918.00 Nikon COOLPIX S3300 1600 Megapixel 2.7 inch display new spot promotional price 808.00”, which is in fact, it is a sentence from the description of the product “Sony W630”. So, these descriptions of promotional merchandise will cause a lot of interference in entity classification.

We use an example to illustrate these problems. As shown Figure 1, the model of the first product should be PowerShot A3300 IS, half of the content of the title is advertising message. The second product is also likely to cause confusion, users cannot determine the product model “G1X” or “PowerShot G1 X” or “PowerShot G1X”. The third product, it is not a same class with the first two product. Description in Figure 2, contains more than a package of product information, such as ” SanDisk 32G SD “, “MINI tripod “.

Figure 1.

Some different produce information of Canon camera

978-1-4666-5198-2.ch016.f01
Figure 2.

The description information returned from the results in Figure 1

978-1-4666-5198-2.ch016.f02

As we can see in this example, there is significantly difference between e-commerce data and structured relational data. Existing entity resolution algorithms for structured data have achieved more satisfactory results. Algorithm described in Whang, Menestrina, Koutrika, Theobald, & Garcia-Molina (2009, June) has reached more than 90% in accuracy with 2,000,000 dataset. However, they will not be able to solve the three basic problems of e-commerce data previously described, so most of these algorithms cannot directly applied on e-commerce data. For these problems, we designed and implemented an online real-time processing of massive e-commerce data query system. The system has the following characteristics:

Complete Chapter List

Search this Book:
Reset