Read a Csv Until a New Line Is Found

Read CSV (RapidMiner Studio Core)

Synopsis

This Operator reads an ExampleSet from the specified CSV file.

Description

CSV is an abbreviation for Comma-Separated Values. The CSV files store data (both numerical and text) in evidently-text grade. All values corresponding to an Example are stored as i line in the CSV file. Values for unlike Attributes are separated by a separator character. The separator remains constant. Each row in the file uses the constant separator for separating Attribute values. The term 'CSV' suggests that the Attribute values would exist separated past commas, but other separators can also be used.

The easiest way to import a CSV file is to utilize the Import Configuration Sorcerer from the Parameters panel. All parameters can also directly be gear up in the Parameters console. For more than details about the Operator, see the clarification of the parameters.

Please make sure that the CSV file is read correctly as an ExampleSet before building a Process that uses it.

Differentiation

There are many Read <source> Operators in the Data Access grouping and Files/Read sub-group. For example, at that place is Read Excel, Read URL, Read SPSS, Read XML and other Operators, which can read ExampleSet from different file formats.

Input

  • file (File)

    A CSV file can be optionally passed in equally a file object. This can exist created with Operators having file output ports such every bit the Read File Operator.

Output

  • output (Information Table)

    This port delivers the ExampleSet created from the CSV file provided at the input port, imported through the Import Configuration Sorcerer or loaded from the path given to the csv file parameter.

Parameters

  • Import_Configuration_Wizard

    This user-friendly wizard guides yous to easily configure this Operator to import the CSV file.

    Range:
  • csv_file

    The path of the CSV file is specified here. It tin can likewise be selected using the 'Choose a file' push.

    Range:
  • column_separators

    Column separators for CSV files tin can be specified here. Information technology can also be provided as a regular expression. A expert agreement of regular expressions can be developed by studying the description of Select Attributes Operator and its tutorial Processes.

    Range:
  • trim_lines

    This parameter indicates if lines should be trimmed (removal of empty spaces at the beginning and the stop) before the column split is performed. This option might exist problematic if TABs ('\t') are used as separators.

    Range:
  • use_quotes

    This parameter indicates if quotes should be regarded. Quotes tin be used to store special characters like cavalcade separators. For case if (,) is set as column separator and (") is set up as quotes character, then a row (a,b,c,d) will be translated as 4 values for 4 columns. On the other hand ("a,b,c,d") will be translated as a single column value a,b,c,d. If this parameter is set to simulated, the quotes character parameter and the escape character parameter cannot exist defined.

    Range:
  • quotes_character

    This parameter defines the quotes grapheme and is just available if utilize quotes is set to true.

    Range:
  • escape_character

    This parameter specifies the character used to escape the quotes and is only bachelor if use quotes is set to true. For case, if (") is used every bit quotes grapheme and ('\') is used equally escape graphic symbol, then ("yes") will exist translated as (yep) and (\"yes\") volition be translated equally ("aye").

    Range:
  • skip_comments

    This parameter is used to ignore comments in the CSV file (if any). If this option is fix to truthful, a annotate character should be defined using the comment characters parameter.

    Range:
  • comment_characters

    This parameter is bachelor if comment characters is set up to true. Lines start with these characters are ignored. If this graphic symbol is nowadays in the heart of the line, anything that comes in that line later on this character is ignored. The annotate character itself is also ignored.

    Range:
  • parse_numbers

    This parameter specifies whether numbers are parsed or not.

    Range:
  • decimal_character

    This graphic symbol is used as the decimal character.

    Range:
  • grouped_digits

    This parameter decides whether grouped digits should exist parsed or not. If this parameter is set to true, a grouping character parameter has to be specified.

    Range:
  • grouping_character

    This character is used as the grouping character. If this graphic symbol is found between numbers, the numbers are combined and this character is ignored. For example if "22-xiv" is present in the CSV file and "-" is ready as the grouping grapheme, so "2214" will be stored.

    Range:
  • infinity_string

    This parameter can be set to parse a specific infinity representation (eastward.g. "Infinity"). If information technology is not set, the local specific infinity representation will be used.

    Range: string
  • date_format

    The parameter specifies the appointment and time format. Many predefined options exist but users can also specify a new format. If text in a CSV file column matches this date format, that cavalcade is automatically converted to engagement type.

    Some corrections are automatically made on invalid date values. For example, a value '32-March' will automatically be converted to '1-April'.

    Columns containing values which cannot be interpreted as numbers will be interpreted as nominal, as long as they exercise not match the engagement and time pattern of the date format parameter. If they match, this column of the CSV file will be automatically parsed as engagement and the corresponding Attribute will exist of type engagement.

    Range:
  • first_row_as_names

    If this parameter is set to true, it is assumed that the first line of the CSV file has the names of the Attributes. If so, the Attributes are automatically named and the beginning line of the CSV file is non treated as a information line.

    Range:
  • annotations

    If the kickoff row as names is not set to truthful, annotations tin can be added using the 'Edit List' push of this parameter, which opens a new menu. This menu allows yous to select whatsoever row and assign an annotation to it. Name, Comment and Unit annotations can be assigned. If row 0 is assigned a Name annotation, information technology is equivalent to setting the first row as names parameter to true. If you lot desire to ignore whatever row, you can comment them as Comment. Retrieve that row number in this menu does not count commented lines.

    Range:
  • time_zone

    Users can select any fourth dimension zone from the listing of provided time zones.

    Range:
  • locale

    Users can select any locale from the listing of provided locales.

    Range:
  • encoding

    Users can select whatever encoding from the list of provided encodings.

    Range:
  • read_all_values_as_polynominal

    This choice allows yous to disable the blazon handling for this operator. Every column volition be read as a polynominal attribute.

    Range:
  • data_set_meta_data_information

    This parameter allows to adapt or override the meta data of the CSV file. Cavalcade index, name, blazon and role can be specified here.

    The Read CSV Operator automatically tries to determine an appropriate information type of the Attributes by reading the first few lines and checking the occurring values. Integer values are assigned the integer data blazon, real values the real data type. Values which cannot be interpreted as numbers are assigned the nominal data type, as long equally they practise not match the format of the date format parameter.

    With the information set meta data information parameter, this automated assignment tin can be adjusted or overwritten.

    Range:
  • read_not_matching_values_as_missings

    If this parameter is gear up to truthful, values that practice not match with the expected value type are considered as missing values and are replaced past '?'. For example, if 'abc' is written in an integer cavalcade, it will be treated equally a missing value. A question marking (?) in the CSV file is also read as a missing value.

    Range:
  • data_management

    This parameter determines how the information is represented internally. Users can select any selection from the provided list.

    Range:

Tutorial Processes

Read a CSV file

(Optional) Relieve the following text in a text file:

att1,att2,att3,att4 # row 1

lxxx.six, yep , 1996.January.21 ,22-14 # row 2

12.43,"yes",1997.MAR.30,23-22 # row 3

thirteen.5,\"no\",1998.AUG.22,23-14 # row 4

23.3,yep,1876.JAN.32,42-65# row 5

21.6,aye,2001.JUL.12,xyz # row half-dozen

12.56,",_?",2002.SEP.18,fifteen-ninety# row vii

This is a sample CSV file.

(Optional) Yous can load this with the given tutorial process by providing its path in the csv file parameter or past using the 'Choose a file' button.

Run the Process and compare the results in the Results view with the CSV file. The Procedure performs the following actions:

'#' is defined as a annotate character so 'row {number}' is ignored in all rows. As the showtime row as names parameter is ready to true, att1, att2, att3 and att4 are ready as Aspect names. The Attribute att1 is fix as real , att2 equally polynominal, att3 as date and att4 as real. For Attribute att4, the '-' character is ignored in all rows because the grouped digits parameter is set to true and '-' is specified as the group character. In row 2, the white spaces at the start and finish of values are ignored because trim lines parameter is gear up to true. In row 3, quotes are not ignored considering apply quotes is set to true, the content inside the quotes is taken as the value for Attribute att2. In row 4, (\"no\") is taken every bit a (no) in quotes, cause the escape character is prepare to '\'. In row 5, the date value is automatically corrected from 'JAN.32' to 'Feb.1'. In row 6, an invalid real value for the Aspect att4 is replaced by '?' considering the read not matching values as missings parameter is fix to truthful. In row seven, quotes are used to remember special characters every bit values including the column separator (,) and a question mark.

shepherdseneiver.blogspot.com

Source: https://docs.rapidminer.com/latest/studio/operators/data_access/files/read/read_csv.html

0 Response to "Read a Csv Until a New Line Is Found"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel