Unlocking critical information

Artificial intelligence, for all its reputed might, is only as good as the data on which it trains. And an alarming amount of the data—up to 80%—that businesses generate on a daily basis is locked inside widely used file formats (including PDF, PPTX, and HTML) that AI models can’t read. Which means their AI models overlook the data-rich content contained within research reports, planning memos, regulatory filings, and many other business-related documents.



Unstructured, a California-based data-processing firm, is out to change that. “The state of most companies’ data is the equivalent of oil that’s just been pumped out of the ground,” says CEO and founder Brian Raymond: “It’s unusable in its raw form.” The company’s revolutionary no-code platform works as a digital refinery, transforming once-inaccessible enterprise data into AI-friendly files.







“We’re the only company out there that can take any unstructured data and get it all into a structured format,” Raymond says. This means that organizations can now utilize their unstructured data in conjunction with large foundation models, unlocking employees’ ability to chat with and analyze their own company’s data. It’s their success in this singular mission that has earned Unstructured a spot on Fast Company ’s list of the world’s Most Innovative Companies for 2024. 



THE DATA CHALLENGE



“Because it was an incredibly challenging area,” Raymond says, “no other companies over the last decade have tried to solve the problem of automatically transforming any unstructured data into formats where they could be used in conjunction with large language models, such as ChatGPT.” Deciding it was a problem worth solving, Raymond founded Unstructured in 2022 and started building a solution that would solve the first, and most labor-intensive bottleneck, for any machine learning initiatives: taking any human-generated data and rapidly transforming it into a language LLMs can understand.



To convert “unstructured” data—the kind of information that does not easily fit into spreadsheets—into AI-friendly file types, the technology would have to transform complex file formats like PDFs, HTML, PNGs and reports and convert them into processable nuggets. Over the course of 18 months, company engineers annotated millions of documents, integrated more than 500 existing code bases into a single platform, pretrained the company’s own AI, and developed a complex data-file pre-processing framework.



The work resulted in technology that allows users to input any kind of document—from simple emails and Slack messages to complex tables and technical manuals— into the Unstructured engine for data preprocessing. “You just send the document, and the system recognizes what it is,” Raymond says. “Think of it as a very complex switchboard that will automatically route the document to where the information can be extracted for use in AI applications.” 



AN OPEN APPROACH



From the start, Unstructured has employed an open-source development process. Data scientists and software developers receive free access to the company’s base technology, which they can use to solve problems in the businesses they work for.



“Nearly 50,000 companies today have used our open-source solution to transform their most important data into AI-ready formats,” Raymond says. “For those looking for a more turnkey option to automatically and continuously transform their data, we’ve built a more high-performance and more feature-complete commercial offering to help them move their AI projects from prototype into production”