Saturday, 29 July 2017

PDF Indexer

In the previous post I mentioned I’d create an index for a load of downloaded PDFs, where the PDFs are of the filename TR-XXXX.PDF {XXXX is a number}, so we have a mapping of TR-XXXX to document title. At the time of writing that post, I wasn’t sure how I’d create the index.

The problem is to map the TR-XXXX.PDF to a document title, and I can’t see any easy way to map this without opening each PDF and typing out that title. It seems like the task is one of manual labour, but the following tool helps a little. It opens each PDF in turn using PowerShell, and allows you to type out the title into PowerShell, which feeds into an index file. You can have the PDF and PowerShell open side by side, so it’s just a case of: type out title, press ENTER, type out another title, and on... It’s a little automation that helps a little!

Image: PDF Indexer in Action!

Note: There probably is an index somewhere listing NetApp TRs and document title, I’ve not found it though and I’ve not asked (I found it not un-useful to be aware of all the TR titles - some of the TRs I never knew existed). If any reader is aware of an official index, please share the knowledge

The Script

Copy into a text editor and save as say PDF_Indexer.ps1 and then run in PowerShell. It needs to be run in the same folder as the PDFs.


[Int]$RangeStart = 4000
[Int]$RangeEnd   = 5000
If(Test-Path "Index.txt"){}
else{New-Item "Index.txt" -Type File -Force}
$FolderContents = Get-ChildItem
For($i=$RangeStart; $i -le $RangeEnd; $i++){
  $FolderContents | Foreach{
    If($_.Name -match ".+\.pdf$"){
      If($_.Name.Contains("$i")){
        Start-Process $_.Name
        $Title = Read-Host "Title TR-$i"
        ("TR-$i" + ":" + " $Title") | Out-File "Index.txt" -Append
      }
    }
  }
}


3 comments:

  1. You can use the metatag "title" of a PDF to rename the file.
    I've tried it with the tool "Advanced Renamer" (https://www.advancedrenamer.com/) which worked quite well. But unfortunately many TRs have no title or just the name of the original template:
    https://abload.de/img/unbenanntwbsy0.png

    There is also a Python-script on GitHub which does the same:
    https://github.com/jdmonaco/pdf-title-rename

    Anways, thanks for the index. I used that list to rename all the TRs with the Advanced Renamer. Unfortunately now your PowerShell-Script thinks the PDFs are not there anymore and redownloads the PDFs.
    It would be nice if it would only check the first seven chars of the filename (TR-XXXX) and then decide if there is a newer version of that PDF or not.
    Even more awesome would it be if the newly downloaded version of a TR would also get the filename of the older version. So that one would not need to rename the newer version of a TR again.

    ReplyDelete
    Replies
    1. Another suggestion: Adjust the download-part of the script so that it at least retries to download a TR for three times. I had two TRs in the 4000 to 5000 range which I actually could download in the second run.

      Delete
    2. Thanks Oli.
      I was originally going to check only on the TR-XXXX part of the filename, then got lazy. I'll revisit the script later (also to re-attempt download). Since there's now an index, I could make the PowerShell read the index post and rename that way.
      I'll put a reminder in my calendar to spend a few moments updating the index every month. Really, I should just ask NetApp to publish a page with links to all their TRs with latest title (not totally sure why this doesn't exist, or - if it does - where it is.)

      Delete