Cleaning a SQL dump

Cleaning a SQL dump

1
NewbieNewbie
1

    Dec 31, 2006#1

    Hello,

    I have SQL dumps that must be "cleaned". Below is a sample of the text.

    Original text before edit:

    Code: Select all

    ', 0, -1, 0, 0);
    INSERT INTO `thread_id` VALUES(9, 'Bim Bousman', 'Hello', '2003-04-18 12:27:00', '<table border=0 cellpadding=2 cellspacing="0"><tr><td>
    <pre>Hello All,
    
    Just want said hello to everyone.
    
    Bim Bousman
    Honolulu
    ************************
    From: <<a href=forum/postID=ewrjw_-O7iw4x2sTK0kw8a1-OvSC8_P9qOb9_i80z0HJ4hjYAhmPsSusItNqv7EPSk9b0Sh-PLjtXhEVLViUO8qZnPpALg"></a>>
    To: <<a href="forum/postID=ewrjw_-O7iw4x2sTK0kw8a1-OvSC8_P9qOb9_i80z0HJ4hjYAhmPsSusItNqv7EPSk9b0Sh-PLjtXhEVLViUO8qZnPpALg"></a>>
    Sent: Thursday, April 17, 2003 9:50 PM
    Subject: Hello
    
    
    >
    >
    >
    > Your advertisement here.Click here for more information.
    >
    >
    ><br></pre>
    
    ', 0, -1, 0, 0);
    INSERT INTO `thread_id` VALUES(10, 'Ferromundo', 'Hello','2003-04-20 20:11:00', '<table border=0 cellpadding=2 cellspacing="0"><tr><td>
    <table bgColor="#ffffff"><tr><td>
    <div><font face="Arial" size="2">Welcome and enjoy your time.
    
    Thanks
    Ferromundo
    
    *********
    Avertisement
    here
    Click mor
    e details.
    *********
    ', 0, -1, 0, 0);
    INSERT INTO `thread_id` VALUES(11, 'Stanley', 'Hello','2003-04-21 16:22:00', '<table border=0 cellpadding=2 cellspacing="0"><tr><td>
    <pre>Hi Ferromundo,
    
    We miss you.When
    you can
    do war again.Call me after
    work hours.
    
    regards
    Stanley 
    =========
    Advertisement here
    Click more
    details.
    =========
    I want it like below after edit:

    Code: Select all

    ', 0, -1, 0, 0);
    INSERT INTO `thread_id` VALUES(9, 'Bim Bousman', 'Hello', '2003-04-18 12:27:00', '
    Hello All,
    
    Just want said hello to everyone.
    
    Bim Bousman
    Honolulu
    ', 0, -1, 0, 0);
    INSERT INTO `thread_id` VALUES(10, 'Ferromundo', 'Hello','2003-04-20 20:11:00', '
    Welcome and enjoy your time.
    
    Thanks
    Ferromundo
    ', 0, -1, 0, 0);
    INSERT INTO `thread_id` VALUES(11, 'Stanley', 'Hello','2003-04-21 16:22:00', '
    Hi Ferromundo,
    
    We miss you.When you can do war again.Call me after work hours.
    
    regards
    Stanley
    There are many sensitive characters in the file like ',; etc. because this is an SQL dump.
    So does anybody here know how to do that? I try find and replace one by one every character, but there are too many different word in file.

    Or if interested I am offering this job to UltraEdit developer or other members here who are experts on using UltraEdit.

    I have a lot of SQL dumps that must be "cleaned". So if this job is NOT FREE, I will pay for it.

    Regards,
    Ringo

    6,681583
    Grand MasterGrand Master
    6,681583

      Jan 02, 2007#2

      Here is a macro which does the job for your example. The macro is designed to run on a file with MS-DOS line terminations. If your SQL dump is in Unix format opened in Unix mode without temporary conversion to DOS (which is never a good setting on Windows) you must replace every ^p (means CRLF - carriage return and line-feed) with ^n (means LF only).

      First the macro trims all trailing spaces on the whole file to be able to work usefully with ^p.

      The first loop finds '<table border=0 with or without preceding space(s) and if found expands the selection to the position where after the character > a character is following which is not the character < or a line termination. Additionally the second find in first loop also selects 0 or more occurrences of spaces and tabs after last >. The selection is reduced by 1 character - the character after last > if there are no spaces/tabs following - and then deletes the selection. If now the current character is either a space or a tab, delete this character too. Because it is important and deleted before a space with a following ' and a line termination is inserted. So this loop should delete the HTML tags before the email content.

      The second loop finds 1 or more line terminations followed by at least 5 * or =. If such a string is found, it is first replaced by a single line termination, before everything to next SQL entry or end of file is also selected and then deleted. This loop should delete the advertisements in the emails.

      The third loop finds with an extremely strange but because of some bugs only working regular expression in UltraEdit style 3 successive lines where none of the 3 lines start with a '. Such a block is expanded to end of the paragraph without including the last character of the paragraph and then converted with ReturnToWrap to a single line. Don't know if this is really needed, but your example has such a modified email content line.

      The macro property Continue if a Find with Replace not found must be checked for this macro.

      Code: Select all

      InsertMode
      ColumnModeOff
      HexOff
      UnixReOff
      Top
      TrimTrailingSpaces
      Loop 
      Find RegExp " ++'<table border=0"
      IfNotFound
      ExitLoop
      EndIf
      StartSelect
      Find RegExp ">[~<^p][ ^t]++"
      Key LEFT ARROW
      Delete
      EndSelect
      IfCharIs 32
      Delete
      EndIf
      IfCharIs 9
      Delete
      EndIf
      " '
      "
      EndLoop
      Top
      Loop 
      Find RegExp "[^p]+^{^*^*^*^*^*^}^{=====^}"
      IfNotFound
      ExitLoop
      EndIf
      "
      "
      StartSelect
      Find "', 0, -1, 0, 0);"
      IfSel
      Key HOME
      Else
      SelectToBottom
      EndIf
      Delete
      EndLoop
      Top
      Loop 
      Find RegExp "%[~'^p][~^p]+^p[~'^p][~^p]+^p[~'^p]"
      IfNotFound
      ExitLoop
      EndIf
      StartSelect
      Find "^p^p"
      Key UP ARROW
      Key END
      Key LEFT ARROW
      ReturnToWrap
      EndSelect
      EndLoop
      Top
      
      Add UnixReOn or PerlReOn (v12+ of UE) at the end of the macro if you do not use UltraEdit style regular expressions by default - see search configuration. The macro command UnixReOff sets the regular expression option to UltraEdit style.
      Best regards from an UC/UE/UES for Windows user from Austria

      4
      NewbieNewbie
      4

        Mar 23, 2007#3

        A really quick and dirty way to strip the HTML markups from the text is to display the file in a browser, then "select all" the displayed text in the browser window (Ctrl-A), and then copy/paste the selection into an open and ready UltraEdit window.

        Usually this works for me. Your mileage may vary.

        "share and enjoy"