BasHamer's comments

BasHamer · on Nov 28, 2018

If this can get me tables out of pdf's generated by crystal reports it would be a godsend for testing. This has been a nightmare to try and solve, the best option so far has been adobe cloud but they don't offer an API for that. I'm excited to try it out.

mjt58 · on Nov 28, 2018

Have you tried e.g. https://tabula.technology, https://pdftables.com, https://pypi.org/project/Camelot/?

counciltime · on Dec 6, 2018

I have a friend who has also developed a number of applications that use OCR specifically for PDF which uses Tesseract. The Report Miner application does a nice job of locating and extracting PDF tables.

https://www.opait.com/tesseractstudio/

https://www.opait.com/Pdfreportminer/

minhtripham · on Dec 17, 2018

Would love to learn more about the apps your friend developed--currently doing research into different OCR use cases + tech. can you shoot me an email at minh@docucharm.com?

BasHamer · on Nov 28, 2018

https://pdftables.com failed the test file, pretty good but inconsistent interpretation across rows, sometimes it split the cell, sometimes it did not. Tabula failed to detect multi-line rows, after manually changing the table it did do better than pdftables.com on splitting cells. Both failed the non-printable whitespace characters that created garbled outputs in the excel. The other one would take some time to rig up.

ocrcustomserver · on Nov 28, 2018

You can also try https://docparser.com/.

If nothing works for you and you're comfortable with sharing an example file, you can send it to me and I could take a look.

cdolan · on Nov 29, 2018

Rather than the Camelot link you provided, I think you meant Excalibur? https://github.com/camelot-dev/excalibur

mjt58 · on Nov 29, 2018

Oh yes, thanks :-)

RandomBookmarks · on Nov 28, 2018

How about https://ocr.space/tablerecognition

It returns table data line by line.

BasHamer · on Nov 28, 2018

handled the non-printed whitespace but butchered the multi- line table headers, so re-building the headers is rough as it is line by line and you need to know what words go together and you have lost the structure.

cdolan · on Nov 29, 2018

Can you send me a copy of what you are trying to extract? We use proprietary stuff (we're in the business of extracting data and performing analysis on invoices for waste, recycling, cellular, etc... stuff that gets "lost" in the AP department.

Happy to see if our tools can help. I've tried everything on the market - DocParser, MediusFlow, KOFAX, Ephesoft, etc... none work well enough in my opinion.

BasHamer · on Nov 29, 2018

I should be able to get you some files, getting approval now; can you let me know how to contact you?

cdolan · on Nov 30, 2018

I changed my about to have a phonetic spelling of my email address, hosted on a very popular domain name. Feel free to toss me an email

BasHamer · on May 29, 2018

Data-driven algorithms are discriminating based upon undesirable/illegal vectors; they are utterly amoral in optimizing their solutions. Even if the algorithm does not have access to the "Age" field, then there are plenty of proxies, like what reunion tour you liked. And the same goes for race, gender, sexual identification, religion, etc.

To solve this we either need the training data to have no illegal/undesired discrimination, or we make the system moral. I think the first is impossible, and the second is what we will do sooner or later.

biztos · on May 29, 2018

How would you make the system moral?

Let's say "moral" means "won't discriminate based on X" and the same "system" is used by everyone, which of course it wouldn't be.

So do you make up a bunch of fake "people" who are equal in everything except X, and test that it doesn't advantage/disadvantage the X's? Would that even be possible if the "system" is getting its inputs from social media?

Do you do mandate some kind of audit of the system's decisions, and require it to choose on average the same percentage of Xs as... what? As there are Xs in the general population? In the candidate pool?

I'd love for this kind of thing to work but even in an idealized hypothetical version it's hard to see how it could.

I think in tech we've already shown that shame is no barrier to hiring discrimination, and as HR+AI type filtering systems preselect candidates for you it'll be harder and harder for you or the government or the disadvantaged candidates to even know if you're discriminating.

You'll judge the "system" based solely on whether the set of candidates you got achieved the outcome you needed.

BasHamer · on May 29, 2018

train it.

Give it examples that we consider moral and examples of what we consider immoral and have it figure it out. The solutions that the algorithms create are less complex than the data that they base the solutions on; so it should be relatively easy for it to model these solutions as data. We would have to train it on what we consider moral and immoral; that would require us to visualize the solutions in a way that a human can make the determination and provide the feedback.

As far as how we get to the solution, that will probably come when there is a liability for discrimination. So lawsuits like the one mentioned. I think that mandating does not work well, it would be more appropriate to make people liable for the decisions made by amoral systems. This liability would create a demand for moral systems.

jerf · on May 29, 2018

"Give it examples that we consider moral"

That's a tall order, honestly. There's a lot of things in the current dominant SV philosophy that are fine and dandy and everybody thinks they agree with everybody else about them as long as everyone carefully agrees to not sit down and actually put numbers on the terms in question ("discrimination is bad!" "I agree!"), but when it comes time to write down concrete rules and provide concrete examples ("hiring a woman is 43.2% preferable to hiring a man; hiring an African American is 23.1% preferable to hiring a Chinese person") are going to make people squirm, and everyone involved in such a project is going to do everything in their power to avoid having to deal with the result.

I bet there's a number of people reading this post right now squirming and deeply, deeply tempted to hit that reply button and start haranguing me about those numbers and how dare I even think such things, as you've been trained to find someone to blame for any occurrence of such words and I'm the only apparent candidate. But I have no attachment to the numbers themselves and I pre-emptively acquiesce to any corrections you'd care to make to them, for the sake of argument. I expect a real model would use more complicated functions of more parameters, I just used simple percentages because they fit into text easily. But any algorithm must produce some sort of result that looks like that, and once you get ten people at a table looking at any given concrete instantiation of this "morality", 9.8 of them are not going to agree it's moral.

I cite the handful of articles we've even seen in peer-reviewed science journals, sometimes linked here on HN, which discuss the discriminatory aspects of this or that current ML system, while scrupulously avoiding answering the question of what exactly a "non-discriminatory" system actually is. It's one of those things that once you see it you can't unsee it. (And given that these papers are nominally mathematical papers by nominally "real scientists", if I were a reviewer I'd "no publish" these papers until they fix that oversight, because it isn't actually that useful to point out that an existing mathematical system fails to conform to a currently-not-existing mathematical standard.)

biztos · on May 29, 2018

Yeah but -- assuming this could work in theory -- is it actually possible to give it examples if you don't know all its inputs?

For instance what if when scraping social media it discovers that a tendency to post memes with the color green together with frequent mention of cats correlates to better Python skills, but it happens that Elbonians are forbidden by law to mention cats? Would the system even know that's what it found? Would it even be knowable in the end? Would that be an immoral outcome even if the system didn't know about the Elbonian Anti-Cat Law? And wouldn't you have to know about the correlation already in order to give it "moral" and "immoral" examples?

I agree that litigation (the threat of litigation) is going to remain a factor for a long time, but I see this potentially turning into some kind of black-box system where there might be very serious discrimination but it would be impossible to prove.

[edit: corrected for spell-check, and apologies to any Albanians who like cats. :-)]

s73v3r_ · on May 29, 2018

"How would you make the system moral?"

Don't use it.

BasHamer · on May 22, 2018

It is worth noting that the cost of correcting inaccurate software has gone down.

Looking at the decision to adopt some defect prevention strategy in software. Cost of strategy < ∑(perceived chance of a defect being prevented)*(cost of the defect + cost of correcting the defect)

1968 vs 2018 Cost of strategy I doubt this changed much. For some strategies, this changed a lot, like Buy vs Build where the cost to buy has gone to near zero due to npm, NuGet, CPAN etc.

Cost of the defect I doubt the perception of this changed much, whether that is accurate is up for debate. Software defects are prone to long tail events that will have a disproportionate effect.

Cost of correcting the defect This went to engineers send on a plane with physical media to some customers mainframe to floppies in the mail to downloadable patch installers to asking the customer to patch from the application to pushing code and letting automated build, deploy, test and background updates. Compared to 1968 the cost went almost to zero.

Strategies have to be better or cheaper to be adopted vs 1968; because the costs of defects have plummeted for many organizations. Unfortunately, the author only references "cost" once in the 15 pages.

greydius · on May 22, 2018

This is a valid observation; however, we also need to consider that we write orders of magnitude more software in 2018 than we did in 1968. I don't think the defect / LOC (or whatever metric you want to use) has decreased. So while the cost per defect might be far lower, the total number of defects keeps increasing. The mitigation strategies that I've seen reduce costs mainly because they simply choose not to fix a lot of problems.

BasHamer · on May 22, 2018

I prefer to look at value and costs.

To me LoC is an indicator of how much you'll spend on maintenance; less is better.

I think that defects per unit of value have plummeted.

tonyedgecombe · on May 22, 2018

This is true although the number of users has increased since 1968, a defect at Google or Facebook is going to affect millions.

BasHamer · on May 22, 2018

Sort of, they actually do rolling updates so that a new version of the code does not affect the whole user base at once. So again, reducing the cost of incorrect software. But it does happen, the VW emission scandal was effectively incorrect code. Noone predicted the 22 billion dollar defect, but due to re-use of components, it is possible.

Long Tail events is a big problem in software and a few lines of code are responsible for a large part of the costs (for a longer version of this answer https://possumlabs.com/the-cost-of-software-bugs-5-powerful-...).

maxxxxx · on May 22, 2018

"But it does happen, the VW emission scandal was effectively incorrect code."

I don't think we should call what VW did "incorrect software". It actually did what it was supposed to do.

BasHamer · on May 16, 2018

So what they are saying is that every car in the employee parking lot will have a cellphone in it?

Wouldn't that cause a lot of break-ins?

BasHamer · on May 8, 2018

Pay late stage startup employees & new-hires to keep a time diary, and see where they spend time with coworkers. Then do the same w/ people not working at startups. I guess that startups have more "non-work" time spend together.

Joining a startup is different than a 9-5 commitment, it means joining a tribe/family. So if the office people leave for a 2-hour lunch before spending a late night in the office, how does a remote person "join" that? Being part of the family means you are there for the leisure as well as the work.

BasHamer · on April 19, 2018

doesn't the answer boil down to marketing?

It is talking about expressing shared values (trough open source and mentoring) value (personal accomplishment) and warm fuzzies (being a fan of the company, product, etc.)

As to how to do that; learn it or hire someone. As you likely only have your hours to sell I'd say the case for outsourcing becomes compelling.

BasHamer · on April 16, 2018

A name mentioned twice might have warranted a paragraph. Robert Moses, the subject of "The Power Broker" a Pulitzer Prize-winning biography (worth an audiable credit).

He ran the Triborough and build a lot of bridges, parks, parkways etc. Most of those bridges do not have rail decks, because he also believed the future was cars, and rail would compete with his source of revenue, toll fees.

It is hard to overstate the impact that one person had on the NY infrastructure, but this article very much understates his impact.