Login

aspiers · 01-19-2016, 09:42 PM

Hi all, just catching up. It's very cool that MSP now supports CSV-based import of multiple songs from a single PDF. It would be especially cool if we could align the format with the files in my project https://github.com/aspiers/book-indices, so that we can all collaborate on building CSV index files which can be used seamlessly both inside MSP and outside it (e.g. with my https://github.com/aspiers/PDFexploder project which explodes large PDFs into one correctly named file per song).

I'd appreciate suggestions on how to move forward with that.

sciurius · (This post was last modified: 01-20-2016, 01:53 AM by sciurius.)

I think that a collective approach towards fakebook indices is very good, and I appreciate you taking initiatives!

However, looking at your CSVs I think they're too limited. They contain just a starting and ending page, not page ranges. If two pages were swapped in your copy of the book --and this happens-- this cannot be dealt with.
Also, there's no provision for information like key, composer, artist. Columns are fixed.

Another potential trap of these indices is "The final page number is optional, because it can often be automatically inferred by the starting page of the next tune, ...".

In NewReal1.csv I see:

Code:
Airegin,17,

# ...

All Or Nothing At All,444,

Always There,18,

This would mean that song "Airegin" runs from page 17 thru 433, and "All or nothing at all" will probably crash the program Smile

.

It would be great to employ a more generic data standard for these indices. With a headings row, it would already be more flexible. For example, an index entry could use either "startpage" and "endpage", or a more powerful "pagerange". And it allows for additional, optional information without affecting tools that to do not understand this.

A final remark: the github page reads "... it's the very well-known CSV (or Comma-Separated Values) format". I'm sorry to disappoint you, but although well-known, there is no such thing as a standard for CSV formatted data files. The de facto standard defined in RFC 4180 is a good starting point.

aspiers · 01-20-2016, 02:33 AM

(01-19-2016, 10:57 PM)sciurius Wrote: I think that a collective approach towards fakebook indices is very good, and I appreciate you taking initiatives!

Thanks! And I very much appreciate your useful feedback!

(01-19-2016, 10:57 PM)sciurius Wrote: However, looking at your CSVs I think they're too limited.

Absolutely - that's why I asked for feedback in the first place :-) The existing repo is just a prototype.

(01-19-2016, 10:57 PM)sciurius Wrote: They contain just a starting and ending page, not page ranges.
If two pages were swapped in your copy of the book --and this happens-- this cannot be dealt with.

Very good point, but this is extremely easy to fix! For example we could collapse the page selection into a single field which supports different types of values:

"5" would mean just page 5
"5-" would mean page 5 and all subsequent pages until the next page not claimed by any other song
"5-8,10,9,11-" would mean the same as above but with pages 9 and 10 swapped round and assuming that the song doesn't finish before page 11

(01-19-2016, 10:57 PM)sciurius Wrote: Also, there's no provision for information like key, composer, artist. Columns are fixed.

Yes - also easily corrected, and I agree that it obviously needs to be.

(01-19-2016, 10:57 PM)sciurius Wrote: Another potential trap of these indices is "The final page number is optional, because it can often be automatically inferred by the starting page of the next tune, ...".

In NewReal1.csv I see:

Code:
Airegin,17, # ... All Or Nothing At All,444, Always There,18,

This would mean that song "Airegin" runs from page 17 thru 433, and "All or nothing at all" will probably crash the program .

Oh dear, you seem to have a very dim view of my programming skills ;-) However you can set your fears to rest; my code easily handles this, e.g. https://github.com/aspiers/PDFexploder/b...ion.rb#L25 - and any other implementation would find it easy to do the same.

(01-19-2016, 10:57 PM)sciurius Wrote: It would be great to employ a more generic data standard for these indices. With a headings row, it would already be more flexible. For example, an index entry could use either "startpage" and "endpage", or a more powerful "pagerange". And it allows for additional, optional information without affecting tools that to do not understand this.

Absolutely - great idea.

(01-19-2016, 10:57 PM)sciurius Wrote: A final remark: the github page reads "... it's the very well-known CSV (or Comma-Separated Values) format". I'm sorry to disappoint you, but although well-known, there is no such thing as a standard for CSV formatted data files. The de facto standard defined in RFC 4180 is a good starting point.

You are not disappointing me, because I entirely disagree ;-) That is a separate discussion which probably does not belong on this forum, but the starting point would be to consider what is a reasonable definition of "standard", then to consider some other popular standards in the computing world (e.g. the myriad of standards relating to email), and finally compare how strictly those are ratified and adhered to by implementations, relative to CSV. We leave in a messy world, where even supposedly clearly defined standards often contain crucial ambiguities, flaws, and rival definitions. However that does not preclude them from being standards; it just means that they could be improved. But I'm not sure it's worth having that discussion on this forum.

Standard or not, it does not diminish the point I made elsewhere, which is that the potential conflict in CSV files between characters used both as delimiters and within data fields is trivially solved by a solution which has been successfully used in CSV implementations all over the world for multiple decades. And that solution is simply to quote either just the data fields which contain delimiter characters, or quote all data fields. (My preference is the former, since the latter leads to CSV files which are less readable by humans.) So this is a solved problem and I would strongly recommend this community to reuse that solution rather than aim for an alternative which relies on the delimiter character never being needed within any data field, since that is pretty much doomed to fail in some corner cases.

Anyway, thanks again for the great feedback! Cheers, Adam

sciurius · 01-20-2016, 10:52 PM

(01-20-2016, 02:33 AM)aspiers Wrote: Oh dear, you seem to have a very dim view of my programming skills ;-) However you can set your fears to rest; my code easily handles this, e.g. https://github.com/aspiers/PDFexploder/b...ion.rb#L25 - and any other implementation would find it easy to do the same.

I trust your code doesn't crash. Neither does mine Wink

.

But my objection is that using only a starting page may not be sufficiently deterministic under all circumstances.
Given the example in my posting, you would need to collect all song data and sort on page number to be sure.

Also:

Quote:"5-" would mean page 5 and all subsequent pages until the next page not claimed by any other song

The usual interpretation (e.g., LaTeX): If you specify a range consisting of a hyphen (or any tie) but with one or two empty page numbers, the following will happen:

1. a range of the form -34 is taken to mean pages 1 to 34;
2. a range of the form 12- is taken to mean page 12 to last page;
3. a range of the form - (only hyphen) is taken to mean page 1 to last page.

sciurius · 01-21-2016, 02:26 AM

I'm going to carry the use of CSV metadata a step further.

First, I'm currently finishing a tool that reads iRealPro data and formats this into a nice PDF. iRealPro songs contain a limited amount of metadata like title, composer, style, key, and tempo.
If the iRealPro data contains an iRealPro playlist, I produce a multi-page PDF document and the corresponding metadata CSV. In other words: a Fakebook plus index in one go.

For reasons not relevant here, my tool can also produce PNGs instead of a PDF. This brings me to the feature request to extend the use of metadata CSV for other imports (in particular, batch inport) as well.

For example, I have a folder with ChordPros or PNGs, each containing one song. I can batch import this folder, but it would be very nice if I could place a metadata CSV in the folder (or specify on the import dialog) so that all imported songs can have some metadata filled in.
I even think that given how far Mike already implemented support for metadata CSVs this won't be hard to add.
Implementation hint: Add a "filename" or "pathname" element to the CSV to match a file with its metadata.

aspiers · 01-21-2016, 03:43 AM

(01-20-2016, 10:52 PM)sciurius Wrote: I trust your code doesn't crash. Neither does mine .

Good to know ;-)

(01-20-2016, 10:52 PM)sciurius Wrote: But my objection is that using only a starting page may not be sufficiently deterministic under all circumstances.
Given the example in my posting, you would need to collect all song data and sort on page number to be sure.

Yes; that's what my code does. Do you envisage any problems with that?

(01-20-2016, 10:52 PM)sciurius Wrote: Also:

Quote:"5-" would mean page 5 and all subsequent pages until the next page not claimed by any other song

The usual interpretation (e.g., LaTeX): If you specify a range consisting of a hyphen (or any tie) but with one or two empty page numbers, the following will happen:

1. a range of the form -34 is taken to mean pages 1 to 34;
2. a range of the form 12- is taken to mean page 12 to last page;
3. a range of the form - (only hyphen) is taken to mean page 1 to last page.

Yes, but I would argue that the usual interpretation is less useful in this scenario than the meaning I am proposing. Otherwise you manually have to go through every single song and provide its last page.

Another approach might be to provide a simple Ruby / Python script which automatically calculates the last pages via the same algorithm and then outputs an updated version of the CSV file with those numbers included. Then the updated version would be fed into my PDFexploder / MSP / whatever else. But this two-phase approach is less convenient for the end user and I don't see any real advantage to it. Am I missing something?

(01-21-2016, 02:26 AM)sciurius Wrote: I'm going to carry the use of CSV metadata a step further.

First, I'm currently finishing a tool that reads iRealPro data and formats this into a nice PDF.

You mean a PDF containing a table of contents of all the songs? My PDFexploder does that, using LaTeX actually:

https://github.com/aspiers/PDFexploder/b....latex.erb

(01-21-2016, 02:26 AM)sciurius Wrote: iRealPro songs contain a limited amount of metadata like title, composer, style, key, and tempo.
If the iRealPro data contains an iRealPro playlist, I produce a multi-page PDF document and the corresponding metadata CSV. In other words: a Fakebook plus index in one go.

My code originally combined multiple fakebooks into a single giant PDF which started with a huge ToC, and then had all songs from all fakebooks sorted alphabetically (rather than just concatenating all fakebooks together). This was quite cool, but I realised that a huge PDF is unwieldly, and it's much nicer to have one song per PDF, because this works more smoothly regardless of what music reader you choose to use. Even in MSP it makes it easier to build set lists by cherry-picking songs.

(01-21-2016, 02:26 AM)sciurius Wrote: For reasons not relevant here, my tool can also produce PNGs instead of a PDF. This brings me to the feature request to extend the use of metadata CSV for other imports (in particular, batch inport) as well.

For example, I have a folder with ChordPros or PNGs, each containing one song. I can batch import this folder, but it would be very nice if I could place a metadata CSV in the folder (or specify on the import dialog) so that all imported songs can have some metadata filled in.
I even think that given how far Mike already implemented support for metadata CSVs this won't be hard to add.
Implementation hint: Add a "filename" or "pathname" element to the CSV to match a file with its metadata.

Yes, that would be awesome, and would also make it easy to bulk import PDFs generated by my PDFexploder.

sciurius · 01-21-2016, 04:44 AM

(01-21-2016, 03:43 AM)aspiers Wrote: Yes; that's what my code does. Do you envisage any problems with that?

Mostly that it is a total unneccessary complication.

Quote:Yes, but I would argue that the usual interpretation is less useful in this scenario than the meaning I am proposing. Otherwise you manually have to go through every single song and provide its last page.
...
Another approach might be to provide a simple Ruby / Python script which automatically calculates the last pages via the same algorithm and then outputs an updated version of the CSV file with those numbers included. ... But this two-phase approach is less convenient for the end user and I don't see any real advantage to it. Am I missing something?

A rule of thumb is that when you generate data once, and process it often, it is better to put the overhead in the generating phase.

sciurius · 01-21-2016, 04:49 AM

Quote:
(01-21-2016, 02:26 AM)sciurius Wrote: First, I'm currently finishing a tool that reads iRealPro data and formats this into a nice PDF.

You mean a PDF containing a table of contents of all the songs? My PDFexploder does that, using LaTeX actually:

I get the feeling that you do not know what iRealPro is.

Quote:This was quite cool, but I realised that a huge PDF is unwieldly, and it's much nicer to have one song per PDF, because this works more smoothly regardless of what music reader you choose to use. Even in MSP it makes it easier to build set lists by cherry-picking songs.

MSPro works with songs, and it does not matter whether a song corresponds to a single-song PDFs or a page selection from a huge PDF.

aspiers · 01-21-2016, 05:08 AM

(01-21-2016, 04:44 AM)sciurius Wrote:
(01-21-2016, 03:43 AM)aspiers Wrote: Yes; that's what my code does. Do you envisage any problems with that?

Mostly that it is a total unneccessary complication.

I would suggest it's necessary in order to provide a more convenient experience to the user (i.e. the person building the CSV indices), because it optimizes the most common case which is that rows in the index correspond to the order in the fakebook, and that songs appear on contiguous pages within the fakebook.

(01-21-2016, 04:44 AM)sciurius Wrote:
Quote:Yes, but I would argue that the usual interpretation is less useful in this scenario than the meaning I am proposing. Otherwise you manually have to go through every single song and provide its last page.
...
Another approach might be to provide a simple Ruby / Python script which automatically calculates the last pages via the same algorithm and then outputs an updated version of the CSV file with those numbers included. ... But this two-phase approach is less convenient for the end user and I don't see any real advantage to it. Am I missing something?

A rule of thumb is that when you generate data once, and process it often, it is better to put the overhead in the generating phase.

For performance optimizations on huge data sets, I entirely agree. However such optimizations are entirely unnecessary here, and they come at the cost of a smoother UX.

aspiers · 01-21-2016, 05:24 AM

(01-21-2016, 04:49 AM)sciurius Wrote:
Quote:
(01-21-2016, 02:26 AM)sciurius Wrote: First, I'm currently finishing a tool that reads iRealPro data and formats this into a nice PDF.

You mean a PDF containing a table of contents of all the songs? My PDFexploder does that, using LaTeX actually:

I get the feeling that you do not know what iRealPro is.

I'm not sure why you get that feeling; I've been using iRealPro heavily since long before it got renamed from iRealB. Perhaps it was because my last statement was a bit misleading - the PDFexploder doesn't take iRealPro data as input, but it does generate a ToC PDF, which is what I got the impression (maybe incorrectly) that your tool also does, once it's parsed the iRealPro data.

(01-21-2016, 04:49 AM)sciurius Wrote:
Quote:This was quite cool, but I realised that a huge PDF is unwieldly, and it's much nicer to have one song per PDF, because this works more smoothly regardless of what music reader you choose to use. Even in MSP it makes it easier to build set lists by cherry-picking songs.

MSPro works with songs, and it does not matter whether a song corresponds to a single-song PDFs or a page selection from a huge PDF.

"It does not matter" is true in the sense that MSPro supports page selections from a huge PDF. But that kind of misses my point about the UX. It's easier to import when there is no page selection to configure. There are other advantages to exploding a big PDF prior to import too, e.g. the potential to save space on your device by only importing the songs you really need, and it makes it easier to open songs using other PDF readers too (which is particularly important to me right now due to http://zubersoft.com/mobilesheets/forum/...p?tid=3224 ...) A third one is that an autogenerated ToC PDF can have hyperlinks to the PDF for each song, and this will work reliably without requiring that your PDF reader supports deeplinking from one PDF to a fixed page inside another PDF.

Having said that, the pros and cons of each approach here are marginal, so I really don't think it's worth debating them in too much depth (and I don't have the time to continue with it anyway). Both approaches are perfectly valid, and both will suit users with differing use cases, so both are worth having as options. Enough said :-)

sciurius · 01-21-2016, 07:38 AM

Quote:Both approaches are perfectly valid, and both will suit users with differing use cases

Indeed.

**itsme** · 01-21-2016, 08:20 AM

I have a number of index files in Excel XLS format that I would like to share as soon as the format is clear and I find the time to make necessary adaptions.
Back in the good old pen/paper/binder time I came across the first fakebooks as PDFs. I kept (and still keep) them on the PC and printed single songs for my ring binders. It soon was clear that finding songs is essential. So I started building a database. Some TOCs I could find in the internet some are scanned and OCR'ed. Proof reading, completing and correcting was and still is a time consuming task, a never-ending story.
I keep an XLS file per fakebook for later reference and imported them into an Access database. With MSP the PDFs became much more usable. So I was motivated to invest time in completing the indexes.
I was involved in lengthy disussions in this forum about keeping big fakebook PDFs or splitting them. To keep my library clear I copy useful fakebooks to the tablet (one big PDF per book) but import only those songs that I really want to play. For every book I import the fakebook's TOC pages as one song, the whole book as a second "song". And I export from my database the indexes of only those fakebooks that are on the tablet as one big XLS song index. This way I have some hundreds of songs in a well-maintained library and several thousands at hand that I can use within minutes. I can jump to the respective page in the fakebook via MSP's "go to page" or add a song to the library by copying an existing song in MSP and copy/paste the meta data from the XLS. That works fine so far regarding fakebooks and I will keep that workflow. Songs that are permanently in my repertoire are individual one file per song, fine-tuned and exported as PDF from Finale, MuseScore, WinWord... or ChordPro. But that's another topic.

**itsme** · 01-21-2016, 08:47 AM

Back to technical details:
What I keep in my index XLS's and the database, see the Firehouse Jazzband example:
Index - the index as listed in the books TOC
in case of the above example it's a song number, in most cases it's a page number and I came across a book that has genre chapters that skip a number of pages after every chapter to allow sorting in more songs in a later edition to the correct chapter without changing all the page numbers
PDFPage - the starting page in the PDF as required for MSP page order or PDF exploder
PDFLastPage - the last page of the song in the PDF, filled only for songs with more than one page
IMHO it makes sense to keep it per song in the XLS so that it is in place when the XLS is sorted alphabetically by title
PageOrder - what MSP needs to access the songs correctly
mostly calculated by Excel macros with individual corrections for e.g. mixed up pages
AlternativTitel - allows a second title entry
useful for e.g. Autumn Leaves = Les Feuilles Mortes, Manha de Carnaval = Black Orpheus = Orfeo Negro and so on
Titel and Key do not need t be explained.
I usuallly do not maintain composer, year, genre and more. For me that's not important enough to be worth the effort. Anybody's welcome to add more...

***Zubersoft*** · 01-21-2016, 04:59 PM

(01-21-2016, 02:26 AM)sciurius Wrote: I'm going to carry the use of CSV metadata a step further.

First, I'm currently finishing a tool that reads iRealPro data and formats this into a nice PDF. iRealPro songs contain a limited amount of metadata like title, composer, style, key, and tempo.
If the iRealPro data contains an iRealPro playlist, I produce a multi-page PDF document and the corresponding metadata CSV. In other words: a Fakebook plus index in one go.

For reasons not relevant here, my tool can also produce PNGs instead of a PDF. This brings me to the feature request to extend the use of metadata CSV for other imports (in particular, batch inport) as well.

For example, I have a folder with ChordPros or PNGs, each containing one song. I can batch import this folder, but it would be very nice if I could place a metadata CSV in the folder (or specify on the import dialog) so that all imported songs can have some metadata filled in.
I even think that given how far Mike already implemented support for metadata CSVs this won't be hard to add.
Implementation hint: Add a "filename" or "pathname" element to the CSV to match a file with its metadata.

That's a very interesting idea that I'll have to look into adding at some point. I currently look for a PDF that matches the CSV name, and if that isn't found, I expect the first line of the CSV to contain a filename. I'm sure this doesn't really match the CSV spec, but I added it just in case it would be useful. It seems odd to me to have a column for filename for the current CSV import mechanism, when that column doesn't apply to any of the song metadata (and then you have to set something for that column for every song). I suppose if I supported populating metadata for new songs created through other import mechanisms, the filename could be specified as a column, but that would definitely require different parsing to handle that.

sciurius · 01-21-2016, 06:50 PM

Yes, the proposal to add a "filename" column applies to bulk importing multiple files from a directory. It is not necessary for the current PDF/CSV import.

Login
Username:
Password:	Lost Password?
	Remember me