July 14, 2017, Baidu open source of a theme model project, name: Familia.
InfoQ reporter for the first time to contact Baidu Familia project leader Jiang Di and interview him, in this article, he will analyze the technical details of Familia project for us.
What is Familia?
The Familia open source project includes document topic inference tools, semantic matching computing tools, and three thematic models based on industrial level corpus training: Latent Dirichlet Allocation (LDA), SentenceLDA, and Topical Word Embedding (TWE).
Familia support users to use "rush" with the text classification, text clustering, personalized recommendations and other scenes of research and application. Taking into account the high cost of the topic model training and the limited status of the open source theme model resources, we will continue to open based on industrial-level corpus training in a number of vertical areas of the theme model, and these models in the industry's typical application, help theme model technology Of scientific research and landing.
According to Jiang Di introduction, Familia theme model project is Baidu research and development of Bayesian network technology is an important part of Baidu has supported a number of products, including search, information flow, paste it and so on.
The subject model in the industrial application of the prevalence of the following problems:
In view of the above, the Familia project has the following objectives:
First, summarize and summarize the value of the industry is relatively large theme model to facilitate the direct use of developers;
The second is to abstract the theme model in the industrial application paradigm, hope that the theme model in the industry's effective "landing" to play a guiding role.
In general, the subject model can be seen as a subclass of the Bayesian field, which can extract some semantic related words from the text. There are also some work in the academic world that does not belong to the Bayesian network, but this work is relatively less.
Jiang Di said: "From our application experience, the theme model can be better qualified for unsupervised scenes of semantic analysis tasks. We abstract the application paradigm of the theme model into two categories from the practical engineering case: semantic representation and semantic matching. These two paradigms have a lot of successful industrial applications, such as news classification, personalized recommendation, etc., the details of our open source project wiki page (Https://github.com/baidu/Familia/wiki) Can be found. In addition to these more traditional application paradigms, we are also exploring the theme of the model technology in some new scenes such as Chatbot on the application. & Rdquo;
The training cost of the theme model
Obtaining a high quality theme model involves many steps: data pre-cleaning, graph model design, parameter estimation algorithm design, and model post-processing. At present, most of the research work is focused on the design of the graph model and the design of the parameter estimation algorithm, and the discussion and data on other steps are relatively few.
In practice, the project team found that other steps are also critical, often determining whether the theme model technology can successfully land in the industry. Billion-level corpus, large-scale data processing, parallelization of high-performance computing These resources for most developers to obtain and achieve higher costs.
So they made the decision: "We directly to the training to obtain high-quality models open to the community, to facilitate developers to use" rush "with the way to support the relevant applications. We will also be open to the code and documentation for each step of the high quality theme model. & Rdquo;
How the subject model evaluates the effect
According to reports, assess the effect of the theme model, the academic community often use Perplexity, Likelihood and other quantitative indicators. These indicators are widely used, but also criticized, many researchers questioned that these indicators can not effectively measure the quality of the theme model. In practical industrial applications, the project team will focus more on the following two indicators: First, the model of the subject can be interpreted, which for the system iterative development and effect analysis is more important; the other is the model after the application of product-related indicators.
At present in the industry there are many excellent theme model tools, many have been open source in github. But Jiang Di said, relative to these tools, Familia more focused on the diversity of models and support for industrial floor. Baidu hopes to give developers in addition to LDA more than the choice. Moreover, they are more concerned about the practicality of the model, so abstract the application of the theme model paradigm, hope in the application level to promote the development of theme model technology further.
Baidu Familia project document is written: an open field of journalism theme model, is about to open the web theme model, in the interview, Jiang Di analysis of the field of division rules: Familia project is the main purpose of serving the product. Therefore, the open field is based on the product to divide, such as news subject model can support the application of information products, web page theme model can support the search engine, due to the size and diversity of corpus, web page theme model has a more extensive The applicability.
Familia project team also said that follow-up will be open for more vertical areas of the model, will be based on the developer's feedback to adjust the field and priority, hoping to cover a more comprehensive product category.
The theme distribution generated by the theme model can be seen as a semantic representation of the document, which can be used for various tasks such as document classification, clustering, content richness analysis, CTR prediction. The document feature representation based on the theme model can be divided into two categories, as shown in Figure 1: one is to reduce the dimension of the subject model, get a number of documents on the subject distribution, LDA, SentenceLDA and other models to support this type of document features And the other is the use of the theme vector and the distribution of the theme of the document, the generated document vector representation, TWE and other topics integrated with the word vector model can support this type of document feature representation.
Case: news quality classification
For news APP, its news from a variety of sources, the quality is usually good and bad. Table 2 lists some examples of low quality news and high quality news titles.
In order to enhance the user experience, usually build a classifier to automatically filter low-quality news. Can be manually designed some of the traditional features: news source site, news content length, the number of pictures, news and so on. In addition to these artificial features, the thematic model can be used to compute the distribution of the themes of each news, and as an additional feature together with the artificial features to form a new set of features (Figure 2 (a)).
The use of artificial labeling 7000 news, news quality is divided into a total of three stalls, of which 0 file that the worst quality, 2 files that the best quality. We used the Gradient Boost Decision Tree (GBDT) to train the 5000 news with the feature set of the artificial feature and the theme extension, respectively, and tested the other 2000 news data. Figure 2 (b) shows the use of different characteristics of the experimental results, the classification of test data accuracy. From these experimental results we can see that the theme distribution as a feature expansion can effectively improve the effect of the classifier.
Case: news clustering
The subject distribution of a document can be seen as a dimensionality reduction process that contains semantic information that can be used to cluster documents. Table 3 shows partial results based on subject distribution and K-Means clustering. It can be seen from the table, based on the distribution of news topics, can be a very good task to complete the clustering, cluster 1 shows the decoration of the news related to the house, cluster 2 is gathered with the stock-related news.
Case: web content richness
In some information retrieval work, the need to measure the richness of web content. By calculating the distribution of the theme of the web, the information entropy of the distribution can be further obtained as an indicator of the abundance of web content. The greater the information entropy, the more rich the content of the page. Web content richness can be introduced as a one-dimensional feature into more complex web page sorting functions.
Many of the applications in the industry have the need to semantically measure the similarity of this article. This kind of demand is called "semantic matching". According to the length of the text, semantic matching can be divided into three categories: short text - short text semantic matching, short text - long text semantic matching and long text - long text semantic matching.
The semantic matching based on the theme model is usually used as a complement to the classical text matching technique, rather than replacing the traditional text matching technique.
Short text - short text semantic matching
Short text - short text of the semantic matching in the industrial application of a very wide range of scenes. For example, in search engines, you need to measure the similarity between user queries and webpage titles. In query recommendations, you need to measure the similarity between query and other queries. These scenes will use short text - short text semantic matching. Since the effect of the thematic model in the short text is not ideal, the application of the word vector in the short text - short text matching task is more common than the subject model. Simple tasks can be used Word2Vec this shallow neural network model trained word vector.
For example, the query recommended task, often to calculate the similarity of two queries, such as = & ldquo; recommended good-looking movie & rdquo; 2016 good movie & rdquo ;. The similarity of the two can be calculated using Cosine Similarity after the vector representation of the two queries is obtained by means of the word vector bitwise addition.
For the more difficult short text - short text semantic matching tasks, we can consider introducing supervised signals and using the more complex neural network models of Deep Structured Semantic Model (DSSM) or Convolutional Latent Semantic Model (CLSM) for semantic similarity estimate.
Short text - long text semantic matching
Short text - long text semantic matching application scenarios are very common in industry. For example, in the search engine, we need to calculate the similarity between a user query and a web page content. Since query belongs to short text and content belongs to long text, this time will use short text - long text semantic matching. In the calculation of similarity, it is necessary to avoid the subject matter mapping directly to the short text, but to calculate the probability of the short text generated by the distribution according to the subject distribution of the long text as the similarity between them:
Which means query, that content, said the word, said the first k theme.
Case: User Query - Advertisement Page Similarity
In the online advertising scenario, you need to calculate the semantic similarity between user queries and ad pages. Then apply SentenceLDA, the advertising page in the domain of the text as a sentence, as shown in Figure 3 (red box for the sentence). First of all, through the theme model to learn the subject distribution of advertising, and then use the formula (1) to calculate the user query and advertising page semantic similarity. The similarity can be applied as a one-dimensional feature in a more complex sorting model. In Figure 4, for query = "wedding shooting" r & D team compared the results of different combinations of features. The left is the Baseline, the right is the introduction of SentenceLDA similarity (based on SentenceLDA query query and advertising page similarity) after the results. Can be seen, compared to Baseline, the introduction of new features after the recall results more in line with the query semantics, can better meet the needs of users.
Long text - long text semantic matching
By using the thematic model, we can get the distribution of the two long texts, and then calculate the distance between the two polynomials as the index of similarity. Hellinger Distance and Jensen-Shannon Divergence (JSD) can also be used to measure the distance of multiple distributions.
Case: news personalized recommendation
Long Text - Long text semantic matching can be used in personalized recommended tasks. In the Internet application, when the accumulation of a large number of user behavior information, the corresponding behavior of the information content can be combined into an abstract "document" and "the document" and the subject of the theme of the distribution of the map can be As a user portrait. For example, in the news personalized recommendation, the user can recently read the news (or news headlines) into a long "document & rdquo ;, and the" and "the distribution of the theme" as the user to read the interest of the user portrait The As shown in Figure 5, by calculating the theme of each real-time news distribution and user portrait between the Hellinger Distance, can be used as a basis for the selection of news to the user, to achieve the effect of personalized news recommendations.
For more cases please visit:Https://github.com/baidu/Familia/wiki
Familia open source, the other companies and institutions will have the opportunity to use the project model and code, you can do some work based on Familia, Jiang Di said: "SMEs can directly use our model and code to support semantic reduction and semantic matching Of the product, without having to invest a lot of manpower to carry out this research, the specific application cases can refer to our GitHub wiki page (Https://github.com/baidu/Familia/wiki), We hope that these cases can play a guiding role. In addition, he added that the open source theme model itself can also be used as a data set to support academic-related research, such as helping researchers research automation theme naming, theme model compression, subject weight and other topics.
The Familia team expects developers and researchers to apply Familia to different scenarios, dig out more potential applications for topic models, and ask for more requirements and needs. Baidu is willing to engage in extensive and in-depth exchanges, and further promote the theme of the model of technology development and application innovation.