To evaluate the accuracy of HSG compared to hysteroscopy and or laparoscopy and compare intra and interobserver variability.
200 infertile females underwent hysterosalpingography, hysteroscopy and/or laparoscopy as part of an infertility work up. HSG examinations were retrospectively reviewed by three radiologists, we compared inter-observer variability, differences between the two results of reading the same examination after three months were compared to calculate intra-observer variability.
Final diagnosis was compared to hysteroscopy and/or laparoscopy. The overall sensitivity, specificity, PPV, NPV and accuracy of each HSG diagnosis was assessed.
Intra-observer reliability was variable: observer 1 (k = 0.21; observer 2 (k = 0.57); observer 3 (k = 0.65). Highest agreement was seen in the detection of a normal uterus, normal tubes and uterine filling defect, lowest agreement seen in the detection of uterine and pelvic adhesions.
First round results showed moderate agreement between the three pairs of radiologists (k = 0.53-0.42), second round results showed the substantial agreement of observer 1 (k = 0.62), moderate agreement was seen between radiologist 2 and 3 (k = 0.44).
With consensus diagnosis of all readers combined, HSG overall accuracy in tubal pathology and uterine cavitary lesions diagnosis was 93%, and 85%, respectively. Lowest accuracy was seen in uterine adhesions 71%.
HSG is more accurate in tubal evaluation than the uterine cavity assessment. HSG interpretation is somewhat subjective, although experience and training may improve reporting skills and interpretation results, however, considerable observer variability exists. The gynecologist should carefully interpret HSG results and provide future management based on comprehensive clinical and radiological data.
Hysterosalpingogram, Hysteroscopy, Laparoscopy, Infertility, Inter-observer variability, Tubal obstruction, Pelvic adhesions
Hysterosalpingography (HSG) is one of the most commonly used imaging modalities in the evaluation of an infertile female . It is still the gold standard in the evaluation of tubal patency and can also evaluate uterine cavity abnormalities. Tubal damage represents the most important cause of infertility, it has an intrinsic cause like salpingitis or extrinsic cause like pelvic surgery. In a meta-analysis by Swart, et al. , HSG sensitivity and specificity in evaluating tubal patency was 65% and 83%, respectively. in peri tubal adhesions was found to be below 50% and considered unreliable .
Uterine cavity abnormalities represent 10% of sub fertile women and 50% of recurrent implantation failure .
Intrauterine filling defects detected by HSG includes multiple differential diagnoses like polyps, sub-mucosal fibroids, endometrial hyperplasia, and Asherman's syndrome. The sensitivity of HSG in detecting uterine cavity ranged from 60-98%, while specificity ranged from 15-80% .
Tubal damage or uterine cavity abnormality detected by HSG as the cause of infertility, will help the gynecologist to decide which operational techniques the patients will undergo (laparoscopy, hysteroscopy or surgery) . Although laparoscopy is superior to HSG in the evaluation of pelvic pathology and peritoneal factors of infertility, HSG is more economical and less invasive, both diagnostic methods are complementary .
HSG' comment or films reading is crucial, interpretation of results will affect the next additional surgical attempts that will be needed, unfortunately, these interpretations may be affected by inter and intra-observer variability in reading .
The previous study included clinicians only described poor to fair interobserver reliability of the HSG , only two studies in the literature were designed for assessment of clinicians and radiologist's observer variability in HSG interpretation [1,7]. Okaya, et al.  reported more interobserver variability by clinicians than radiologists and concluded that radiologists were more compatible than clinicians, in contrast to Renbaum, et al.  who described the low interobserver agreement between readers with general good interobserver reliability.
While keeping in mind that high reproducibility is essential for a clinical test to achieve high diagnostic accuracy, also, diagnostic accuracy and reproducibility of HSG may be affected by local circumstances, such as experience. To the best of our knowledge, HSG reproducibility exclusively among radiologists has not been studied and the question can be posed, whether observer bias is responsible for variation in observed prevalence of uterine and tubal evaluation in an infertile female. We conducted this study in our own setting to clarify accuracy, reproducibility, and repeatability of HSG in diagnosing tubal pathology and uterine cavitary lesions.
Ethics Committee of the Faculty of Medicine of our University approved this retrospective study conducted at a University Hospital Center (Divisions of Reproductive Endocrinology and Radiology), informed consent was not required. We searched medical records on pack system over a 2-year period, 200 infertile females underwent hysterosalpingograms (HSG), hysteroscopy and/or laparoscopy as part of infertility work up were included in the study. Patients with a previous pelvic operation, malignancy, ectopic pregnancy were excluded from the study. The mean duration between HSG and laparoscopy was 3 ± 1.1 month, and between HSG and hysteroscopy was 2 ± 0.5 months.
Standard HSG examination with good quality (including first and second films) were included in the study, they were evaluated by three radiologists who were specifically involved in HSG reading on a weekly basis, but with various levels of experience; two years' experience , five years' experience , and ten years' experience .
Each reader interpreted 10 HSG examinations in each session, the duration of each session was 30 minutes.
All radiologists were blinded to patient's identity and diagnosis given by other readers, HSG examination was evaluated in the following manner: Within normal limits uterus was rated as , congenital anomaly , filling defects , uterine adhesions . Tubes were evaluated similarly, within normal limits , unilateral or bilateral hydrosalpinx , unilateral or bilateral cornual obstruction , unilateral or bilateral distal obstruction , pelvic adhesions .
The results of first reading and results of second reading after three months later were recorded to calculate inter and intra-observer variability.
Finally, the three radiologists interpreted all HSG examinations with a consensus to reach a final diagnosis which compared with the gold standard (hysteroscopy and/or laparoscopy).
IBM SPSS Statistics version 21 (IBM Corp., Armonk, NY) was used for data analysis. Interobserver and interobserver agreement were tested for the presence of a uterine cavity or tubal abnormalities, type of abnormality (as stated in methodology section). Cohen's kappa coefficient was used for calculation of interobserver and interobserver agreement. Kappa value of 0.81-1.00 indicate excellent agreement, a k-value of 0.61-0.80 indicate good agreement; a k-value of 0.41-0.60 indicate moderate agreement; a k-value of 0.21-0.40 indicates fair agreement; a k-value of < 0.20 indicates poor agreement . The consistency of diagnosis between readers was estimated using interclass correlation (ICC) .
True positive results were considered if HSG diagnosis is confirmed by hysteroscopy or laparoscopy otherwise, false positive results were considered. True negative results were considered if no abnormality were detected by HSG which confirmed by hysteroscopy and/or laparoscopy; otherwise, false negative results were considered. Sensitivity, specificity, positive, negative predictive values and accuracy of HSG was calculated. p-value ≤ 0.05. was considered as statistically significant.
Two hundred women were analyzed in this study, 123 women had primary infertility (61.5%) and 77 had secondary infertility (38.5%). Mean age of women in our study was 30.36 ± 3.79 years.
Among 200 women, hysteroscopy and HSG showed a normal uterine cavity in 72 (36%) and 68 in (34%) respectively. Abnormal cavity in 128 (64%) and 132 (66%) respectively.
Among 128 women laparoscopy and HSG showed normal tubes in 89 (69.5%) and in 86 (67.1%) respectively. Abnormal tubes in 39 (30.5%) and in 42 (32.8%), respectively.
Laparoscopy detected pelvic adhesions in 35.2% (31/128), and peri tubal adhesions in 18% (23/128), HSG diagnosed pelvic adhesions in 46.5% (41/128), and peri tubal adhesions in 15% (19/128).
HSG, hysteroscopy and laparoscopic findings were provided in Figure 1.
Figure 1: Summarized HSG, hysteroscopy and laparoscopic findings. View Figure 1
The sensitivity, specificity, positive predictive, and negative predictive value and accuracy of HSG in the diagnosis of uterine cavitary lesions and tubal pathology were summarized in (Table 1).
Table 1: Overall accuracy of HSG diagnosis compared to hysteroscopy and/or laparoscopy (gold standard)†. View Table 1
There was variable interobserver agreement, it was fair between observer 1 and 2 and between observers 1 and 3, while a moderate agreement was seen between observers 2 and 3. Agreement between radiologists (interobserver variability) for first and second round reading were summarized in (Table 2).
Table 2: Agreement between Pairs of radiologists (interobserver variability) for first and second round reading (P value < 0.05)†. View Table 2
This inter-observer variability was lowest in the detection of pelvic, uterine and peritubal adhesions (Figure 2). It was highest for detection of a normal uterine contour and uterine filling defect. (Figure 3 and Figure 4). Agreement between the three radiologists combined for each diagnosis was provided in (Table 3).
Figure 2: Female aged 26-years-old, G3 P0 A3 with uterine and pelvic adhesions in hysterosalpingography, (a) first film showed relatively reduced size of the uterine cavity with haziness of its outlines (arrow in a) and immediate spill of contrast from both tubes. Second film (b) showed loculation of the contrast in the central part of the pelvis (arrow in b), only expert radiologist correctly diagnosed the case, blindness to patient clinical data of previous three dilation and curettage for recurrent three miscarriage before 20 weeks may contribute to the other two radiologist mis diagnosis, results were confirmed by laparoscopy. View Figure 2
Figure 3: Female aged 31-years-old, nullipara, with fundal fibroid, hysterosalpingography first film (a) showed displaced uterine cavity downward and to the right with filling defect at the fundus (arrow in a) with non-visualized both tubes with no immediate spill of contrast, all radiologist agreed to the diagnosis of fundal fibroid with bilateral tubal block, abdominal ultrasound image of the same patient illustrating bulky uterus with large hypoechoic fundal fibroid (circular lines in b), it displace the endometrial line downward (arrow in b). View Figure 3
Figure 4: Female aged 29-years-old, G1 P1 A0, with multiple fibroids. Hysterosalpingography first film (a) showed relatively enlarged size of the uterine cavity with distorted outlines with immediate spill of contrast from both tubes, two radiologists agreed to the possible diagnosis of multiple fibroid, transvaginal ultrasound image showed enlarged uterus with multiple hypoechoic mural and submucous fibroid, largest one measuring 4 × 3 cm (arrow in b), it displace the endometrial line (dashed arrow in b), (c) hysteroscopic image confirmed fibroids. View Figure 4
Table 3: Agreement of the three radiologists combined in each diagnosis. View Table 3
Second round reading showed an increase in overall kappa value between observer 1 and 2 and observers 1 and 3, while between observers 2 and 3, moderate agreement unchanged. (Table 2), improved kappa value is mainly seen in the detection of uterine congenital anomaly (ICC = 0.93) and in the detection of a hydrosalpinx (ICC = 0.60)
After the specified time period, substantial agreement was achieved by observer 1, while observer 2 and 3 showed moderate agreement (Table 1).
The intra-observer differences were variable, it was poor in the detection of pelvic adhesions, moderate for the detection of hydrosalpinx and tubal obstruction, it was highest for the detection of normal uterine contour and normal tubes. Agreement between the two readings for each radiologist (interobserver variability) was summarized in (Table 4).
Table 4: Agreement between the two readings for each radiologist (intra-observer variability) (P value < 0.01)†. View Table 4
HSG represents an important diagnostic modality for uterine cavity and tubal evaluation in an infertile female, subsequent management is greatly influenced by HSG results.
We found that HSG overall accuracy in diagnosing uterine cavity pathology was 85%. previous studies reported a wide range of sensitivity and specificity. Sensitivity ranged from 21 to 81% and specificity ranged from 70 to 98%, respectively. Low sensitivity was reported by Taskin, et al. , they attributed it to predominate male factor in couples attending their infertility clinic. Nigam, et al. , reported 70% PPV of HSG for detecting the intrauterine lesions. Shakya  reported HSG accuracy 90%. Clever, et al.  found no significant difference between HSG and hysteroscopy in uterine adhesions detection. Vahdat, et al.  reported HSG accuracy 84.8% in the diagnosis of uterine malformations. Acholonu UC, et al.  reported 50.3% overall accuracy of hysterosalpingography in uterine cavity evaluation and concluded that HSG will remain an important screening test in an infertile female, however it is less reliable than sonohysterography. We think that these differences may be attributed to different sample size and prevalence of each pathology, also the difference in HSG techniques within a different length of menstrual cycle and variable reporting methods as most of uterine cavity abnormalities interpretation was somewhat subjective.
We found 93% accuracy of HSG in tubal pathology diagnosis, similar results were reported by previous studies [16,17]. Lavy, et al.  concluded that in 95% of patients with normal HSG or suspicious unilateral obstructed tube, laparoscopy is not indicated, and patient management will not be altered. However, patients with suspicious bilateral tubal pathology, laparoscopy is mandatory and may alter patient therapy. We think this better accuracy of HSG in tubal assessment than uterine cavity assessment may be explained based on tubal evaluation is straight forward clinical entity with the well-defined end point (such as the state of tubal filling, presence or absence of hydrosalpinx), while uterine cavity assessment was more liable for subjectivity in interpretation and diagnosis.
In agreement with the previous study , we found that the accuracy of HSG in the diagnosis of peri tubal adhesions was 84%.
Lowest HSG accuracy was seen in the diagnosis of pelvic adhesions, it consistent previous study  who concluded that HSG is less accurate than laparoscopy in the diagnosis of pelvic adhesions, and both laparoscopy and HSG are complementary. In cases with severe pelvic disease, HSG showed high PPV in the detection of peritoneal factors of infertility, however, due to low NPV of HSG, suspicious and even normal HSG should undergo diagnostic laparoscopy .
HSG can be performed and interpreted by radiologists and not restricted to the gynecologist. Inter and intra-observer variability among radiologist is unknown.
Kappa coefficient was influenced by disease prevalence, agreement Percentage may compensate this disadvantage and explained the discrepancy between high agreement percentage and low k values in low prevalent abnormalities. However, kappa value clinical significance depends upon its context, values cannot be always compared between studies because it depends up on disease prevalence .
The current study showed fair to moderate inter-observer agreement between all three readers with improved kappa values for least experienced one in the second round reading results. This unsatisfying result was in agreement with the previous study , Low agreement results were also obtained by Glatstein, et al. , they found marginal agreement between experienced clinician's in HSG interpretation and concluded that clinicians should review original HSG films and radiologists report for a better management plan. In a recent study conducted by Okaya, et al.  they found more interobserver variability by clinicians than radiologists and concluded that radiologists were more compatible than clinicians and recommended better-designed studies to confirm observer variabilities and decide who better read HSGs?
We found highest interobserver agreement in the detection of normal uterine contour and uterine filling defect, it agrees with previous studies [1,7]. This may be attributed to the concept that physicians were more likely to agree in diagnosis with well-defined end points .
Glatstein, et al. , described the lowest agreement in detection of pelvic adhesions, we also obtained similar results, while Renbaum, et al. , described the highest agreement between radiologists in the diagnosis of uterine adhesions, our low agreement may be attributed to small number of cases with this category and low HSG accuracy in this area which was reported in previous literature with receiver operator curve estimated sensitivity ranges (0.0-0.83) and specificity ranges (0.5-0.99) .
Our study found moderate agreement in the detection of a hydrosalpinx, similar results were reported [1,5], in contrast to Renbaum, et al. , they described less agreement of radiologist than clinicians in hydrosalpinx diagnosis.
We found improvement in the interobserver agreement in the least experienced radiologist with a stable moderate agreement in the other two radiologists, this improvement occurred in the diagnosis of hydrosalpinx and congenital malformation. Good interobserver reliability between clinicians and radiologists, especially in the detection of normal findings, as described in the previous study .
Considering limitation of our study: Readers' number, we choose three readers because previous literature reported that, the use of more than three observers will hardly improve agreement studies reliability using the k value . Another disadvantage is readers blindness for infertile females' clinical data. The radiologist was given HSG snapshot series and did not look at the real-time HSG while it was being done to be able to observe dynamic filling and spillage. Although, no limitation to either group, will not affect inter reader variability. Now most clinicians prefer not to be present at HSG time and read post hoc films later to make clinical decisions. The second reading in our study might be affected by training effect and detection bias as a consequence of the first reading, to overcome this we ensured a gap of at least 3 months between the two readings. The duration between HSG and hysteroscopy or laparoscopy may possibly contribute to bias. In the future, we encourage future research to develop a guideline with exact definitions of what should be judged as HSG abnormality. Also, a randomized controlled trial to evaluate observer variability in HSG and its impact on patient management.
In conclusion, HSG is more accurate in tubal evaluation than the uterine cavity assessment. HSG interpretation is somewhat subjective, although experience and training may improve reporting skills and interpretation results, however, a considerable level of diagnostic discrepancy exists. The introduction of HSG standard systematic interpretation may improve reproducibility. The gynecologist should carefully interpret HSG results and provide future management based on comprehensive clinical and radiological data.
HSG is more accurate in tubal evaluation than uterine cavity assessment.
HSG interpretation is somewhat subjective, although experience and training may improve reporting skills and interpretation results, considerable observer variability may exist.
The gynecologist should carefully interpret HSG results and provide future management based on comprehensive clinical and radiological data.
The authors have no conflicts of interest.
No disclosure of funding received for this work from any organization.
The institutional review board was approved by our university, informed consent was waived.
Study Type-single institute study, Diagnostic (retrospective cohort), Level of Evidence 3a.