Importing an unusually formatted text file and converting to CSV

Question

I have some data that looks like:

--------------------------------------
data point 300
--------------------------------------
Empty DataFrame
Columns: []
Index: []
--------------------------------------
data point 301
--------------------------------------
               participant_role_category                   participant_title 
                     Managers/Bookrunner           Joint Lead Managers-Books             
--------------------------------------
data point 302
--------------------------------------
               participant_role_category                   participant_title 
                     Lead                                      Co-manager(s)
--------------------------------------
data point 303
--------------------------------------
               participant_role_category                   participant_title 
                     Lead                                      Agent(s)
                     Co-manager                                Manager(s)

Where I basically have a table, a separation with "data point 101" etc with some lines, and then another table, for 40,000 instances.

Can I import this into Mathematica, turn it into a table, and export it as a csv in some way? So I can extract say data point 303 and then extract the data from the table?

The answer, of course, is "yes". The most general way would be to import as a string (or lines) and then start parsing it. Build functions to handle the workflow: split on the main dividers, split on subdividers, handle malformed data, etc, etc. I suggest you give that a try yourself and then come back with specific questions when you get stuck. — lericr
– lericr, Commented May 10, 2023 at 19:27
Ok so basically split the string manually, that makes sense. Thought there might be an option to autodetect a table format. — apg
– apg, Commented May 10, 2023 at 21:15

anon · Accepted Answer · 2023-05-11 01:09:18Z

Here is an approximation:

in = StringToStream[
"--------------------------------------
data point 300
--------------------------------------
Empty DataFrame
Columns: []
Index: []
--------------------------------------
data point 301
--------------------------------------
               participant_role_category                   participant_title 
                     Managers/Bookrunner           Joint Lead Managers-Books             
--------------------------------------
data point 302
--------------------------------------
               participant_role_category                   participant_title 
                     Lead                                      Co-manager(s)
--------------------------------------
data point 303
--------------------------------------
               participant_role_category                   participant_title 
                     Lead                                      Agent(s)
                     Co-manager                                Manager(s)"];
    
file = 
   StringSplit[ReadString[in], RegularExpression[
     "[\r\n]+"]]
     
Close[in]

InputForm[file]

getNumber = Interpreter["Number"]

Dataset[
  Merge[
    Flatten[
    look = False; 
    ({line = StringTrim[#1]; 
      Which[
        StringMatchQ[
           line, RegularExpression["^-+$"]], 
          look = False; Nothing,
        StringStartsQ[line, "data point "], 
          key = getNumber[StringSplit[line][[
              -1]]]; Nothing,
        StringStartsQ[
           line, "participant_role_category"], 
          look = True; Nothing,
        look, 
          Association[key -> StringTrim /@ 
             StringSplit[line, RegularExpression[
               "      *"]]],
        True, 
          Nothing]} & ) /@ file], Identity]]

Dataset[<|301 -> {{"Managers/Bookrunner", 
      "Joint Lead Managers-Books"}}, 
   302 -> {{"Lead", "Co-manager(s)"}}, 
   303 -> {{"Lead", "Agent(s)"}, 
     {"Co-manager", "Manager(s)"}}|>]

Now, it is your turn! "Salt to taste!"

If you do not like the dataset wrapper, then remove the wrapper. I used the data text of the questioner verbatim and thereby missed the final carriage return and line feed. In some cases, I had to make formatting decisions (e.g., whether keys were numeric and how to keys with multiple sets of data. Finally, I deliberately did not provide the final unwrap to a matrix made to easy Export to CSV, because of the multiple sets of data -- multiple rows or multiple columns. — anon
– anon, Commented May 11, 2023 at 18:50

Daniel Huber · Accepted Answer · 2023-05-11 10:34:10Z

For an example, I copied your data into a file. Note that the last line should terminate with a CR/LF. Then we can read the data by:

dat = ReadString["d:/tmp/dat.txt"];

Next follows a pretty ugly dissection of the data structure to get the data into a form that is suitable for the construction of an association:

tmp = StringCases[dat, 
   RegularExpression["data point (\\d+).*\\n--+.\\n(  .+\\n)+"]];
num = StringCases[#, 
      RegularExpression["data point (\\d+)"] :> 
       ToExpression["$1"]] & /@ tmp // Flatten;
tmp = StringDelete[tmp, 
   RegularExpression["(\\s+participant.*\\n)|(data.*\\n)|(--+.*\\n)"]];
tmp = Map[
   StringCases[#, RegularExpression["(.+)"] :> StringTrim@"$1"] &, 
   tmp, {-1}];
tmp = Map[
   StringCases[#, RegularExpression["(.+)"] :> StringTrim@"$1"] &, 
   tmp, {-1}];
tmp = Map[Sequence @@ StringTake[#, {{1, 26}, {26, -1}}] &, tmp, {-1}];
tmp = Map[StringTrim, tmp, {-1}];
assoc = AssociationThread[num, tmp]

With assoc we may now e.g. write:

assoc[303]

{{"Lead", "Agent(s)"}, {"Co-manager", "Manager(s)"}}

Or:

assoc[301]

{{"Managers/Bookrunner", "Joint Lead Managers-Books"}}

To export it as a CSV list, you may write:

Export["d:/tmp/dat.csv", 
 Map[Sequence, Flatten /@ Transpose[{num, tmp}], 2]]

Stack Exchange Network

Importing an unusually formatted text file and converting to CSV

2 Answers 2

Hot Network Questions

Importing an unusually formatted text file and converting to CSV

2 Answers 2

Related

Hot Network Questions