Most information is in the form of unstructured data. This can range from video and photo to text documents. Especially in a business context, text is very important as much is captured in text both internally and in the interaction with customers in text. This can be in terms of product descriptions, websites, call reports, twitter feeds etc…..
A great wealth of knowledge and insights are captured which can be leveraged to improve business process’. Applying machine learning and other predictive techniques to text, offers knowledge intensive processes in business a great way to increase quality and efficiency. However great the potential, text analytics does have caveats to be aware of.
To leverage text, one of the important aspects to consider is the language aspects. Although there are natural aspects of languages such as syntax and vocabulary which have to be taken into consideration, a number of areas are really important to work consiously on when using texts as input data for data science. The three main aspects are:
- Languages: In many business, one language dominates but there might be many languages used in the company and in the interaction with customers. A centralized help desk in Berlin for a travel agency will deal with more than German. Most likely call reports are also in English, Dutch or maybe Turkish. It is essential to consider this, when preparing for analytics. Also, the real world is much more messy than the university lab. A single document might hold multiple languages. For example an IT product description in the Netherlands will hold many English terminology.
- Domains: A language is inherently culturally determined, also within an organization. Even within an organization the language domains can be very important. An engineering department acts linguistically different than the HR department both using different acronyms, concepts and wordings.
- Context: Beside domains involved, context is just as important. In emotionally situation such as an escalation, an employee will craft an email very different than when providing feedback on a design after a project just started. Sometimes emails are brief, sometimes formal, other times funny.
As is the nature of text, it is sloppy and not always well phrased. At Kentivo we consider this key. Most corpus for text are often based on relative structured documents such as documents from the European parliament. The business world is often far more messy in which to apply text analytics.
Natural Language Processing
Important parts in texts relate the concepts and entities that are described. Sometimes the text might be sparse such as a 140 character Twitter message, other times it can be a 120 page document. To be able to leverage a text or group of texts, the three key aspects to extract meaning from the documents
- Entities: A major part to be able to extract meaning it, is to determine the concepts and entities used in a text. This will require to extract information such as topics covered, people mentioned, places etc.
- Semantics: Besides the entities in a text, insight in the semantics is important to determine relations and concepts conveyed. Look for example at: IBM’s OnDemand strategy was a precursor to Cloud Computing. This is quite different than IBM’s CloudComputing strategy was a precursor to OnDemand.
- Structure: Larger texts can have a lot of structure in it and relations between the different parts of the text. An example is a European Tender from a Ministry to procure a new road. The documents will be highly structured which convey meaning on the procedure as well as how to respond to the tender.
Together these concepts give insights in the text and the knowledge present in it. Once such aspects are determined for a text many different analytical exercises become possible. One can establish a fingerprint of a text and see trends on specific topics. An alternative is to identify which people in an organization posses in-depth knowledge on a certain topic. Alternatively, it can be used to evaluate employee reports to identify possible early triggers for burn-out. The application depend highly on the priority and objectives in a specific department or organization.
Many analytics techniques and toolboxes exist to start with text analytics. One can go for the black-box approach leveraging a system such as IBM’s Watson propositions or a more glass-box approach leveraging open-source libraries such as NLTP or Tensorflow. How to approach text as a data source for analytics will depend strongly on the situation.
If an organisation has a large in-house corpus of text to train the analytics models, a more glass-box approach might be beneficial. In this situation, companies can create an advantage for themselves. The advantage is not so much in the technology and the algorithms but in the predictive model and the coefficients determined for a best prediction. These are unique and not-reproducible by others as the data used is self-owned.
At Kentivo, we have a strong background in using text for analytics purposes. Already as early as 2007, some of our senior consultants were involved in innovative text analytics solutions. The solutions ranged from entity extraction, call log analytics, HR/Expert identification to alerting systems. The objective was always to create a edge in a core process of an organisation.
Kentivo has experience that covers the full range from developing a business case, via a prototype to maintaining and implementing a production system.