Importing from Treepad – many blank characters

Home Forums cherrytree Importing from Treepad – many blank characters

Viewing 6 posts - 1 through 6 (of 6 total)
  • Author
    Posts
  • #125368
    rah
    Member

    I’ve been experimenting with importing Treepad data into CT. It works very nicely except for one big problem – all the info in the right-hand panel is filled with MANY blank trailing lines.

    In TP, when you enter info into the right-hand panel, there are no blank lines below where you are typing, so the data has no trailing spaces (unless you do that yourself). But when I import this data into CT, it is filled with many lines of spaces.

    I have included 2 screen snapshots, one showing the original in TP, the other showing the same data after import into CT. In the right panel of each, I have positioned my cursor and then pressed Ctrl-A, to select all the text in that panel.

    As you can see, the TP entry is only selecting the lines with the actual words “When dealing with quotations that extend over more than one paragraph, you need to put quotation marks at the beginning of each paragraph but at the end only of the final one.” This is just to demonstrate that there are no trailing blank characters/lines.

    But the CT entry has not just the words, but then an entire screen full of blank characters.

    This is a problem because if I later try to export the CT data into say a PDF or a text file, I get MANY blank screens filled with all this empty text. This is a big problem for me because I was wanting to view the same data on an Adrnoid tablet but cannot get a good looking file because of all this blank data.

    Can CT be fixed to stop doing this, perhaps? Or is there some workaround that someone knows. Thanks!
    treepad snapshot
    cherrytree snapshot

    #125369
    eureka
    Member

    I don’t know but using SearchMonkey as a search tool I searched cherrytree looking for Treepad …

    Hits found here ..

    cherrytree-0.38.2/modules/imports.py

    Line Number: 1992
    “””Returns a CherryTree string Containing the Treepad Nodes”””
    Line Number: 2003
    class TreepadHandler:
    Line Number: 2004
    “””The Handler of the Treepad File Parsing”””
    Line Number: 2060
    “””Returns a CherryTree string Containing the Treepad Nodes”””

    and in cherrytree-0.38.2/modules/core.py

    Line Number: 1051
    “””Add Nodes Parsing a Treepad File”””
    Line Number: 1053
    filter_name=_(“Treepad Document”),
    Line Number: 1066
    treepad = imports.TreepadHandler()

    My idea is that perhaps the parser might be hacked to purge leading blank lines in imported Treepad documents.

    • This reply was modified 2 months, 2 weeks ago by eureka.
    #125371
    rah
    Member

    I am now thinking that Treepad is at fault, not CT. I say this because even though when TP displays info in the right-hand panel it ends it at the last actual character, in fact the FILE that is being used has MANY space characters.

    What makes me think this is because I experimented with exporting an entire file (with many nodes) from TP into multiple text files (in folders – CT can nicely import a set of files like this). When I then look at those plain text files exported by TP, they have many space characters after the final actual text. And indeed, if you look at a TP database file with a programming editor, the file itself also has a ton of trailing blank characters.

    So, I can understand why CT is just taking what it is given – garbage in, garbage out. I guess one way around this is to write say a Perl program to strip out all those offending characters before giving the file to CT (an edit macro – say an Ultraedit macro) could also be used. Oh well…

    #125441
    rah
    Member

    I found a way around this which I will mention here in case someone else wants to do this (Treepad is definitely discontinued now).

    In Treepad, you export your Treepad .hjt file using:

    File>Export>Subtree>To File(s)

    In the dialog, pick “Export to multiple files” and then click Next
    In the next dialog, pick HTML files and then click Next

    In the next dialog, pick Recurse Subtree (I guess; it works)

    The result is a directory structure containing all your TP nodes in separate subdirectories as HTML files. As I already mentioned, CT can successfully import a directory of files like this (I tried text files earlier, and HTML files work too).

    What makes HTML files better than text is that all those awful blank lines in the TP exports are not just blank lines, but are HTML blank lines that look like this:

    +nbsp;(br)^p+nbsp;(br)^p

    (I have had to change all the ampersandnbsp;s to start with a plus, and all the brackets to use parens here so you can see them on the page)

    This is easier to handle (at least for me) for striping this crap out. Using an editor like UltraEdit, a global change like this does the trick:

    change +nbsp;(br)^p+nbsp;(br)^p
    to
    nothing

    [where ^p is UltraEdit’s expression for CRLF; unix uses $, I think; you could use hex 0d0a in any editor, I think]

    This effectively replaces any occurrences of 2 lines of spaces with nothing, leaving only one-line occurrences, which might be found throughout the file (and you want to keep). It DOES remove ALL the blank lines at the end of a file, presumably because the global change keeps finding the pairs, wiping them out and continuing on, wiping them again, etc.

    Ultraedit has a Replace In Files option that can do such a replace on an entire directory structure of files, so for one Treepad hjt export, it did this replace on 32866 finds in 328 files in one shot.

    What you are left with is a clean set of files that can be imported into CT without all those awful blank lines. Very nice!

    So, I am going to do this for all my TP files and hopefully start using CT for real very soon. 🙂

    • This reply was modified 1 month, 1 week ago by rah. Reason: allow html codes to show
    #125444
    eureka
    Member

    One of the hidden benefits I have found in playing with CherryTree is that each file can be processed as an XML file.
    If you are using the save to sqlite version then switch to *.ctd mode and inspect the myscript.ctd file. It is clean xml. In fact it can be edited using any xml editor (I use XMLCopyEditor). XML elements can be parsed to remove training lines. Or even apply regex to the*.ctd file.

    For frequent changes I sometimes use python script to parse the xml format.

    Yet another variant is to regard each CherryTree file as an XML document which can be uploaded to a NoSQL database for XML. I have tried this approach with eXist-db and this allows multiple CherryTree documents to be held in a collection for querying. Just ensure that you change the extension from *.ctd to *.xml.

    So you can use multiple methods to trim trailing lines.

    Incidentally this approach works for other needs such as export/import tree branches as discussed in another recent thread.

    • This reply was modified 1 month, 1 week ago by eureka.
    • This reply was modified 1 month, 1 week ago by eureka.
    #125448
    rah
    Member

    Thanks for the tips on XML, eureka. I actually decided to use XML format on my CT files already, just because I liked the sound of the way it handled them (loading everything initially, etc), plus I like to have files stored as pure text whenever possible just for potential flexibility (which is what your tips illustrate). I will take a look at XMLCopyEditor.

    Concerning trailing blank lines in TP files, I am not entirely sure that this was caused by TP. Just about all my TP files started life many years ago in an earlier tree-arrangement program called Vault (for Windows). It was very nice but was discontinued. So I converted everything to TP. I may have picked up those blank lines at that point. Of course, I could experiment with TP to figure all this out, but since TP is gone, it doesn’t much matter. But perhaps others won’t have this trailing blank line problem.

Viewing 6 posts - 1 through 6 (of 6 total)
  • You must be logged in to reply to this topic.