Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

If by the raw data you mean the actual filings with the SEC, then the raw data is available via ftp from the SEC itself.

http://sec.gov/edgar/searchedgar/ftpusers.htm

Parsing the edgar documents is a mixed bag. Many of the older filings and some of the more recent ones are in text rather than HTML. Finding footnotes in the HTML is probably not that bad but the issue is the lack of complete coverage where you miss the HTML footnote or the document is in text instead

I've worked on parsing the HTML tables for tables like Balance Sheet, Cash Flow, etc. It was problematic and I only got about 70% of the way there but I think a more complex rule base could get to 90%. The issue is that 90% isn't really good enough for many users.

I've heard that CapitalIQ/Thompson Reuters actually use Indian financial professionals to manually extract the info. This could be a good way to backfill/double check missing/bad values but I chose not to try that path. In the end, many of the potential customers will opt for paying a much higher price for a better brand and/or higher level processing like normalizing accounting standards.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: